Iris Dataset: Exploratory Data Analysis, Data Visualization, and Classification¶

iris
About dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set

Iris Dataset: Simple Exploratory Data Analysis (EDA)¶

Import Modules¶

import pandas as pd # import modules pandas as pd

Load Dataset¶

iris_df = pd.read_csv('./dataset/iris/Iris.csv') # load file from csv .extension to be data frame
iris_df.head() # show the first 5 rows of data

Drop Column 'Id'¶

iris_df.drop(columns='Id', inplace=True) # delete Columns 'Id'
# True so we can modify current data frame without make another data frame

iris_df.head() # show the first 5 rows of data

Identify the Shape of the Dataset¶

iris_df.shape # show the dimension of the dataset with row and column

(150, 5)

Get the List of Columns¶

iris_df.columns # list of columns

Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

Identify Data Types For Each Column¶

iris_df.dtypes # datatype for every column

SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

Get Basic Dataset Information¶

iris_df.info() # information of the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Identify Missing Values¶

iris_df.isna().values.any() # detect is there any NULL value in the dataset

False

Identify Duplicate Entries/Rows¶

iris_df[iris_df.duplicated(keep=False)] # show all rows with duplication

iris_df.duplicated().value_counts() # count the total of duplicated data

False    147
True       3
dtype: int64

Drop Duplicate Entries/Rows¶

iris_df.drop_duplicates(inplace=True) # delete duplicate data
iris_df.shape

(147, 5)

Describe the Dataset¶

iris_df.describe() # data description

Correlation Matrix¶

iris_df.corr() # correlation between column

Iris Dataset: Data Visualization¶

Import Modules¶

import matplotlib.pyplot as plt # import maplotlib as plt for data visualization
import seaborn as sns # import seaborn as sns for data visualization

%matplotlib inline
# output from data visualization data will be directed to notebook

Heatmap¶

sns.heatmap(data=iris_df.corr()) # visualization using Heatmap

<matplotlib.axes._subplots.AxesSubplot at 0x2362af88160>

Bar Plot¶

iris_df['Species'].value_counts() # count every species (Iris)

Iris-versicolor    50
Iris-virginica     49
Iris-setosa        48
Name: Species, dtype: int64

iris_df['Species'].value_counts().plot.bar() # visualization using Bar plot
plt.tight_layout() # maximize visualization with current screen
plt.show()

sns.countplot(data=iris_df, x='Species') # visualization using Bar plot with seaborn (colorized)
plt.tight_layout()

Pie Chart¶

iris_df['Species'].value_counts().plot.pie(autopct='%1.1f%%', labels=None, legend=True) # visualization using pie chart using percentage
plt.tight_layout()

Line Plot¶

fig,ax = plt.subplots(nrows=2, ncols=2, figsize=(8,8)) # visualization using Line Plot

iris_df['SepalLengthCm'].plot.line(ax=ax[0][0]) # visualize sepal length
ax[0][0].set_title('Sepal Length')

iris_df['SepalWidthCm'].plot.line(ax=ax[0][1]) # visualize sepal width
ax[0][1].set_title('Sepal Width')

iris_df.PetalLengthCm.plot.line(ax=ax[1][0]) # visualize petal length
ax[1][0].set_title('Petal Length')

iris_df.PetalWidthCm.plot.line(ax=ax[1][1]) # visualize petal width
ax[1][1].set_title('Petal Width')

Text(0.5, 1.0, 'Petal Width')

iris_df.plot() # visualization using Line Plot as one plot
plt.tight_layout()

Histogram¶

iris_df.hist(figsize=(6,6), bins=10) # visualization using Histogram # bins=10 to expanse it more
plt.tight_layout()

Boxplot¶

iris_df.boxplot() # visualization using Box Plot (Quarter 1,2,3, Min, Max, Outlier)
plt.tight_layout()

iris_df.boxplot(by="Species", figsize=(8,8)) # visualization using Line Plot based on species
plt.tight_layout()

Scatter Plot¶

sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm', data=iris_df, hue='Species') # visualization using Scatter Plot
plt.tight_layout()

Pair Plot¶

sns.pairplot(iris_df, hue='Species', markers='+') # visualization using Pair Plot
plt.tight_layout()

Violin Plot¶

sns.violinplot(data=iris_df, y='Species', x='SepalLengthCm', inner='quartile') # visualization using Violin Plot
plt.tight_layout()

Iris Dataset: Classification Models¶

Import Modules¶

from sklearn.model_selection import train_test_split # as splitter dataset into training and testing set
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report # evaluate model performance

Dataset: Features & Class Label¶

X = iris_df.drop(columns='Species') # put features into variable X
X.head() # show the first 5 rows of X

y = iris_df['Species'] # put class label (target) into variabel y
y.head() # show the first 5 rows of y

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Species, dtype: object

Split the Dataset Into A Training Set and A Testing Set¶

# split dataset into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=10)

print('training dataset')
print(X_train.shape)
print(y_train.shape)
print()
print('testing dataset:')
print(X_test.shape)
print(y_test.shape)

training dataset
(88, 4)
(88,)

testing dataset:
(59, 4)
(59,)

K Nearest Neighbors¶

from sklearn.neighbors import KNeighborsClassifier # using KNN as classifier

k_range = list(range(1,26)) # specify neighbors 1-25, 26 is the max
scores = []
for k in k_range:
    model_knn = KNeighborsClassifier(n_neighbors=k) # config algorithm
    model_knn.fit(X_train, y_train) # training model/classifier
    y_pred = model_knn.predict(X_test) # prediction
    scores.append(accuracy_score(y_test, y_pred)) # performance evaluate

plt.plot(k_range, scores) # (x-axis) total neighbor (y-axis) accuracy value
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.tight_layout()
plt.show()

model_knn = KNeighborsClassifier(n_neighbors=3) # config algorithm with 3 neighbors
model_knn.fit(X_train,y_train) # training model/classifier
y_pred = model_knn.predict(X_test) # prediction

Accuracy Score¶

print(accuracy_score(y_test, y_pred)) # accuracy evaluate

0.9322033898305084

Confusion Matrix¶

print(confusion_matrix(y_test, y_pred)) # evaluate confusion matrix

[[18  0  0]
 [ 0 19  2]
 [ 0  2 18]]

Classification Report¶

print(classification_report(y_test, y_pred)) # classification evaluate

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.90      0.90      0.90        21
 Iris-virginica       0.90      0.90      0.90        20

       accuracy                           0.93        59
      macro avg       0.93      0.93      0.93        59
   weighted avg       0.93      0.93      0.93        59

Logistic Regression¶

from sklearn.linear_model import LogisticRegression # import Logistic Regression as classifier

model_logreg = LogisticRegression(solver='lbfgs', multi_class='auto')
model_logreg.fit(X_train,y_train)
y_pred = model_logreg.predict(X_test)

Accuracy Score¶

print(accuracy_score(y_test, y_pred))

0.9322033898305084

Confusion Matrix¶

print(confusion_matrix(y_test, y_pred))

[[18  0  0]
 [ 0 20  1]
 [ 0  3 17]]

Classification Report¶

print(classification_report(y_test, y_pred))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.87      0.95      0.91        21
 Iris-virginica       0.94      0.85      0.89        20

       accuracy                           0.93        59
      macro avg       0.94      0.93      0.93        59
   weighted avg       0.93      0.93      0.93        59

Support Vector Classifier¶

from sklearn.svm import SVC # import SVC as classifier

model_svc = SVC(gamma='scale')
model_svc.fit(X_train,y_train)
y_pred = model_svc.predict(X_test)

Accuracy Score¶

print(accuracy_score(y_test, y_pred))

0.9661016949152542

Confusion Matrix¶

print(confusion_matrix(y_test, y_pred))

[[18  0  0]
 [ 0 21  0]
 [ 0  2 18]]

Classification Report¶

print(classification_report(y_test, y_pred))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.91      1.00      0.95        21
 Iris-virginica       1.00      0.90      0.95        20

       accuracy                           0.97        59
      macro avg       0.97      0.97      0.97        59
   weighted avg       0.97      0.97      0.97        59

Decision Tree Classifier¶

from sklearn.tree import DecisionTreeClassifier # import Decision Tree Classifier as classifier

model_dt = DecisionTreeClassifier()
model_dt.fit(X_train,y_train)
y_pred = model_dt.predict(X_test)

Accuracy Score¶

print(accuracy_score(y_test, y_pred))

0.9491525423728814

Confusion Matrix¶

print(confusion_matrix(y_test, y_pred))

[[18  0  0]
 [ 0 20  1]
 [ 0  2 18]]

Classification Report¶

print(classification_report(y_test, y_pred))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.91      0.95      0.93        21
 Iris-virginica       0.95      0.90      0.92        20

       accuracy                           0.95        59
      macro avg       0.95      0.95      0.95        59
   weighted avg       0.95      0.95      0.95        59

Random Forest Classifier¶

from sklearn.ensemble import RandomForestClassifier # import Random Forest Classifier as classifier

model_rf = RandomForestClassifier(n_estimators=100) # (100 means 100 trees)
model_rf.fit(X_train,y_train)
pred_rf = model_rf.predict(X_test)

Accuracy Score¶

print(accuracy_score(y_test, y_pred))

0.9491525423728814

Confusion Matrix¶

print(confusion_matrix(y_test, y_pred))

[[18  0  0]
 [ 0 20  1]
 [ 0  2 18]]

Classification Report¶

print(classification_report(y_test, y_pred))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.91      0.95      0.93        21
 Iris-virginica       0.95      0.90      0.92        20

       accuracy                           0.95        59
      macro avg       0.95      0.95      0.95        59
   weighted avg       0.95      0.95      0.95        59

Accuracy Comparison For Various Models.¶

models = [model_knn, model_logreg, model_svc, model_dt, model_rf] # compare model that already created
accuracy_scores = []
for model in models:
    y_pred = model.predict(X_test) # make prediction
    accuracy = accuracy_score(y_test, y_pred) # accuracy score
    accuracy_scores.append(accuracy)
    
print(accuracy_scores)

[0.9322033898305084, 0.9322033898305084, 0.9661016949152542, 0.9491525423728814, 0.9491525423728814]

plt.bar(['KNN', 'LogReg', 'SVC', 'DT', 'RF'],accuracy_scores) # visualize 5 algorithm performance
plt.ylim(0.90,1.01)
plt.title('Accuracy comparison For Various Models', fontsize=15, color='r')
plt.xlabel('Models', fontsize=18, color='g')
plt.ylabel('Accuracy Score', fontsize=18, color='g')
plt.tight_layout()
plt.show()

Study References¶

(In Indonesian)¶

kaggle 01 | Belajar Exploratory Data Analysis, Visualisasi Data, Klasifikassi | Machine Learning¶

https://www.youtube.com/watch?v=Op3019SFYzI&t=558s

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	Iris-setosa
1	2	4.9	3.0	1.4	0.2	Iris-setosa
2	3	4.7	3.2	1.3	0.2	Iris-setosa
3	4	4.6	3.1	1.5	0.2	Iris-setosa
4	5	5.0	3.6	1.4	0.2	Iris-setosa

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
9	4.9	3.1	1.5	0.1	Iris-setosa
34	4.9	3.1	1.5	0.1	Iris-setosa
37	4.9	3.1	1.5	0.1	Iris-setosa
101	5.8	2.7	5.1	1.9	Iris-virginica
142	5.8	2.7	5.1	1.9	Iris-virginica

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm
count	147.000000	147.000000	147.000000	147.000000
mean	5.856463	3.055782	3.780272	1.208844
std	0.829100	0.437009	1.759111	0.757874
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.400000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm
SepalLengthCm	1.000000	-0.109321	0.871305	0.817058
SepalWidthCm	-0.109321	1.000000	-0.421057	-0.356376
PetalLengthCm	0.871305	-0.421057	1.000000	0.961883
PetalWidthCm	0.817058	-0.356376	0.961883	1.000000

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2