Iris Dataset: Exploratory Data Analysis, Data Visualization, and Classification

iris
About dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set

Iris Dataset: Simple Exploratory Data Analysis (EDA)

Import Modules

In [1]:
import pandas as pd # import modules pandas as pd

Load Dataset

In [2]:
iris_df = pd.read_csv('./dataset/iris/Iris.csv') # load file from csv .extension to be data frame
iris_df.head() # show the first 5 rows of data
Out[2]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

Drop Column 'Id'

In [3]:
iris_df.drop(columns='Id', inplace=True) # delete Columns 'Id'
# True so we can modify current data frame without make another data frame

iris_df.head() # show the first 5 rows of data
Out[3]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Identify the Shape of the Dataset

In [4]:
iris_df.shape # show the dimension of the dataset with row and column
Out[4]:
(150, 5)

Get the List of Columns

In [5]:
iris_df.columns # list of columns
Out[5]:
Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

Identify Data Types For Each Column

In [6]:
iris_df.dtypes # datatype for every column
Out[6]:
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

Get Basic Dataset Information

In [7]:
iris_df.info() # information of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Identify Missing Values

In [8]:
iris_df.isna().values.any() # detect is there any NULL value in the dataset
Out[8]:
False

Identify Duplicate Entries/Rows

In [9]:
iris_df[iris_df.duplicated(keep=False)] # show all rows with duplication
Out[9]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
9 4.9 3.1 1.5 0.1 Iris-setosa
34 4.9 3.1 1.5 0.1 Iris-setosa
37 4.9 3.1 1.5 0.1 Iris-setosa
101 5.8 2.7 5.1 1.9 Iris-virginica
142 5.8 2.7 5.1 1.9 Iris-virginica
In [10]:
iris_df.duplicated().value_counts() # count the total of duplicated data
Out[10]:
False    147
True       3
dtype: int64

Drop Duplicate Entries/Rows

In [11]:
iris_df.drop_duplicates(inplace=True) # delete duplicate data
iris_df.shape
Out[11]:
(147, 5)

Describe the Dataset

In [12]:
iris_df.describe() # data description
Out[12]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
count 147.000000 147.000000 147.000000 147.000000
mean 5.856463 3.055782 3.780272 1.208844
std 0.829100 0.437009 1.759111 0.757874
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.400000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Correlation Matrix

In [13]:
iris_df.corr() # correlation between column
Out[13]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
SepalLengthCm 1.000000 -0.109321 0.871305 0.817058
SepalWidthCm -0.109321 1.000000 -0.421057 -0.356376
PetalLengthCm 0.871305 -0.421057 1.000000 0.961883
PetalWidthCm 0.817058 -0.356376 0.961883 1.000000

Iris Dataset: Data Visualization

Import Modules

In [14]:
import matplotlib.pyplot as plt # import maplotlib as plt for data visualization
import seaborn as sns # import seaborn as sns for data visualization

%matplotlib inline
# output from data visualization data will be directed to notebook

Heatmap

In [15]:
sns.heatmap(data=iris_df.corr()) # visualization using Heatmap
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x2362af88160>

Bar Plot

In [16]:
iris_df['Species'].value_counts() # count every species (Iris)
Out[16]:
Iris-versicolor    50
Iris-virginica     49
Iris-setosa        48
Name: Species, dtype: int64
In [17]:
iris_df['Species'].value_counts().plot.bar() # visualization using Bar plot
plt.tight_layout() # maximize visualization with current screen
plt.show()
In [18]:
sns.countplot(data=iris_df, x='Species') # visualization using Bar plot with seaborn (colorized)
plt.tight_layout()

Pie Chart

In [19]:
iris_df['Species'].value_counts().plot.pie(autopct='%1.1f%%', labels=None, legend=True) # visualization using pie chart using percentage
plt.tight_layout()

Line Plot

In [20]:
fig,ax = plt.subplots(nrows=2, ncols=2, figsize=(8,8)) # visualization using Line Plot

iris_df['SepalLengthCm'].plot.line(ax=ax[0][0]) # visualize sepal length
ax[0][0].set_title('Sepal Length')

iris_df['SepalWidthCm'].plot.line(ax=ax[0][1]) # visualize sepal width
ax[0][1].set_title('Sepal Width')

iris_df.PetalLengthCm.plot.line(ax=ax[1][0]) # visualize petal length
ax[1][0].set_title('Petal Length')

iris_df.PetalWidthCm.plot.line(ax=ax[1][1]) # visualize petal width
ax[1][1].set_title('Petal Width')
Out[20]:
Text(0.5, 1.0, 'Petal Width')
In [21]:
iris_df.plot() # visualization using Line Plot as one plot
plt.tight_layout()

Histogram

In [22]:
iris_df.hist(figsize=(6,6), bins=10) # visualization using Histogram # bins=10 to expanse it more
plt.tight_layout()

Boxplot

In [23]:
iris_df.boxplot() # visualization using Box Plot (Quarter 1,2,3, Min, Max, Outlier)
plt.tight_layout()
In [24]:
iris_df.boxplot(by="Species", figsize=(8,8)) # visualization using Line Plot based on species
plt.tight_layout()

Scatter Plot

In [25]:
sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm', data=iris_df, hue='Species') # visualization using Scatter Plot
plt.tight_layout()

Pair Plot

In [26]:
sns.pairplot(iris_df, hue='Species', markers='+') # visualization using Pair Plot
plt.tight_layout()

Violin Plot

In [27]:
sns.violinplot(data=iris_df, y='Species', x='SepalLengthCm', inner='quartile') # visualization using Violin Plot
plt.tight_layout()

Iris Dataset: Classification Models

Import Modules

In [28]:
from sklearn.model_selection import train_test_split # as splitter dataset into training and testing set
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report # evaluate model performance

Dataset: Features & Class Label

In [29]:
X = iris_df.drop(columns='Species') # put features into variable X
X.head() # show the first 5 rows of X
Out[29]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
In [30]:
y = iris_df['Species'] # put class label (target) into variabel y
y.head() # show the first 5 rows of y
Out[30]:
0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Species, dtype: object

Split the Dataset Into A Training Set and A Testing Set

In [31]:
# split dataset into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=10)

print('training dataset')
print(X_train.shape)
print(y_train.shape)
print()
print('testing dataset:')
print(X_test.shape)
print(y_test.shape)
training dataset
(88, 4)
(88,)

testing dataset:
(59, 4)
(59,)

K Nearest Neighbors

In [32]:
from sklearn.neighbors import KNeighborsClassifier # using KNN as classifier
In [33]:
k_range = list(range(1,26)) # specify neighbors 1-25, 26 is the max
scores = []
for k in k_range:
    model_knn = KNeighborsClassifier(n_neighbors=k) # config algorithm
    model_knn.fit(X_train, y_train) # training model/classifier
    y_pred = model_knn.predict(X_test) # prediction
    scores.append(accuracy_score(y_test, y_pred)) # performance evaluate
In [34]:
plt.plot(k_range, scores) # (x-axis) total neighbor (y-axis) accuracy value
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.tight_layout()
plt.show()
In [35]:
model_knn = KNeighborsClassifier(n_neighbors=3) # config algorithm with 3 neighbors
model_knn.fit(X_train,y_train) # training model/classifier
y_pred = model_knn.predict(X_test) # prediction
Accuracy Score
In [36]:
print(accuracy_score(y_test, y_pred)) # accuracy evaluate
0.9322033898305084
Confusion Matrix
In [37]:
print(confusion_matrix(y_test, y_pred)) # evaluate confusion matrix
[[18  0  0]
 [ 0 19  2]
 [ 0  2 18]]
Classification Report
In [38]:
print(classification_report(y_test, y_pred)) # classification evaluate
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.90      0.90      0.90        21
 Iris-virginica       0.90      0.90      0.90        20

       accuracy                           0.93        59
      macro avg       0.93      0.93      0.93        59
   weighted avg       0.93      0.93      0.93        59

Logistic Regression

In [39]:
from sklearn.linear_model import LogisticRegression # import Logistic Regression as classifier
In [40]:
model_logreg = LogisticRegression(solver='lbfgs', multi_class='auto')
model_logreg.fit(X_train,y_train)
y_pred = model_logreg.predict(X_test)
Accuracy Score
In [41]:
print(accuracy_score(y_test, y_pred))
0.9322033898305084
Confusion Matrix
In [42]:
print(confusion_matrix(y_test, y_pred))
[[18  0  0]
 [ 0 20  1]
 [ 0  3 17]]
Classification Report
In [43]:
print(classification_report(y_test, y_pred))
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.87      0.95      0.91        21
 Iris-virginica       0.94      0.85      0.89        20

       accuracy                           0.93        59
      macro avg       0.94      0.93      0.93        59
   weighted avg       0.93      0.93      0.93        59

Support Vector Classifier

In [44]:
from sklearn.svm import SVC # import SVC as classifier
In [45]:
model_svc = SVC(gamma='scale')
model_svc.fit(X_train,y_train)
y_pred = model_svc.predict(X_test)
Accuracy Score
In [46]:
print(accuracy_score(y_test, y_pred))
0.9661016949152542
Confusion Matrix
In [47]:
print(confusion_matrix(y_test, y_pred))
[[18  0  0]
 [ 0 21  0]
 [ 0  2 18]]
Classification Report
In [48]:
print(classification_report(y_test, y_pred))
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.91      1.00      0.95        21
 Iris-virginica       1.00      0.90      0.95        20

       accuracy                           0.97        59
      macro avg       0.97      0.97      0.97        59
   weighted avg       0.97      0.97      0.97        59

Decision Tree Classifier

In [49]:
from sklearn.tree import DecisionTreeClassifier # import Decision Tree Classifier as classifier
In [50]:
model_dt = DecisionTreeClassifier()
model_dt.fit(X_train,y_train)
y_pred = model_dt.predict(X_test)
Accuracy Score
In [51]:
print(accuracy_score(y_test, y_pred))
0.9491525423728814
Confusion Matrix
In [52]:
print(confusion_matrix(y_test, y_pred))
[[18  0  0]
 [ 0 20  1]
 [ 0  2 18]]
Classification Report
In [53]:
print(classification_report(y_test, y_pred))
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.91      0.95      0.93        21
 Iris-virginica       0.95      0.90      0.92        20

       accuracy                           0.95        59
      macro avg       0.95      0.95      0.95        59
   weighted avg       0.95      0.95      0.95        59

Random Forest Classifier

In [54]:
from sklearn.ensemble import RandomForestClassifier # import Random Forest Classifier as classifier
In [55]:
model_rf = RandomForestClassifier(n_estimators=100) # (100 means 100 trees)
model_rf.fit(X_train,y_train)
pred_rf = model_rf.predict(X_test)
Accuracy Score
In [56]:
print(accuracy_score(y_test, y_pred))
0.9491525423728814
Confusion Matrix
In [57]:
print(confusion_matrix(y_test, y_pred))
[[18  0  0]
 [ 0 20  1]
 [ 0  2 18]]
Classification Report
In [58]:
print(classification_report(y_test, y_pred))
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.91      0.95      0.93        21
 Iris-virginica       0.95      0.90      0.92        20

       accuracy                           0.95        59
      macro avg       0.95      0.95      0.95        59
   weighted avg       0.95      0.95      0.95        59

Accuracy Comparison For Various Models.

In [59]:
models = [model_knn, model_logreg, model_svc, model_dt, model_rf] # compare model that already created
accuracy_scores = []
for model in models:
    y_pred = model.predict(X_test) # make prediction
    accuracy = accuracy_score(y_test, y_pred) # accuracy score
    accuracy_scores.append(accuracy)
    
print(accuracy_scores)
[0.9322033898305084, 0.9322033898305084, 0.9661016949152542, 0.9491525423728814, 0.9491525423728814]
In [60]:
plt.bar(['KNN', 'LogReg', 'SVC', 'DT', 'RF'],accuracy_scores) # visualize 5 algorithm performance
plt.ylim(0.90,1.01)
plt.title('Accuracy comparison For Various Models', fontsize=15, color='r')
plt.xlabel('Models', fontsize=18, color='g')
plt.ylabel('Accuracy Score', fontsize=18, color='g')
plt.tight_layout()
plt.show()

Study References

(In Indonesian)

kaggle 01 | Belajar Exploratory Data Analysis, Visualisasi Data, Klasifikassi | Machine Learning