Dataset: Melbourne Housing¶

iris
About dataset: Melbourne Housing Market

Importing Modules¶

import pandas as pd # import modules pandas as pd

Load Dataset¶

housing_df = pd.read_csv('./dataset/melb_data.csv') # load file from csv .extension to be data frame
housing_df.head() # show the first 5 rows of data

Identify the Shape of the Dataset¶

housing_df.shape # show the dimension of the dataset with row and column

(13580, 21)

Get the List of Columns¶

housing_df.columns # list of columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

Identify Data Types For Each Column¶

housing_df.dtypes # datatype for every column

Suburb            object
Address           object
Rooms              int64
Type              object
Price            float64
Method            object
SellerG           object
Date              object
Distance         float64
Postcode         float64
Bedroom2         float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
YearBuilt        float64
CouncilArea       object
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount    float64
dtype: object

Get Basic Dataset Information¶

housing_df.info() # information of the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longtitude     13580 non-null  float64
 19  Regionname     13580 non-null  object 
 20  Propertycount  13580 non-null  float64
dtypes: float64(12), int64(1), object(8)
memory usage: 2.2+ MB

Identify Missing Values¶

housing_df.isna().values.any() # detect is there any NULL value in the dataset

True

Identify Duplicate Entries/Rows¶

housing_df[housing_df.duplicated(keep=False)] # show all rows with duplication

housing_df.duplicated().value_counts() # count the total of duplicated data

False    13580
dtype: int64

Describe the Dataset (summary)¶

housing_df.describe() # data description

Correlation Matrix¶

housing_df.corr() # correlation between column

Data Visualization¶

Import Modules¶

import matplotlib.pyplot as plt # import maplotlib as plt for data visualization
import seaborn as sns # import seaborn as sns for data visualization

%matplotlib inline
# output from data visualization data will be directed to notebook

Heatmap¶

sns.heatmap(data=housing_df.corr()) # visualization using Heatmap

<matplotlib.axes._subplots.AxesSubplot at 0x1bea2832460>

Bar Plot¶

housing_df['Rooms'].value_counts() # count every Suburb

3     5881
2     3648
4     2688
1      681
5      596
6       67
7       10
8        8
10       1
Name: Rooms, dtype: int64

sns.countplot(data=housing_df, x='Rooms') # visualization using Bar plot with seaborn (colorized)
plt.tight_layout()

Pie Chart¶

housing_df['Rooms'].value_counts().plot.pie(autopct='%1.1f%%', labels=None, legend=True) # visualization using pie chart using percentage
plt.tight_layout()

Scatter Plot¶

sns.scatterplot(x='Bedroom2', y='Car', data=housing_df, hue='Rooms') # visualization using Scatter Plot
plt.tight_layout()

Case: show the widest landsize¶

housing_df.describe().loc['max', 'Landsize'] # find in row first then column

433014.0

Data Cleaning¶

housing_df = housing_df.dropna() # drop NA / NULL

housing_df.shape

(6196, 21)

Choosing Prediction Target¶

y = housing_df['Price'] # choose Price feature as prediction target then show it
y

1        1035000.0
2        1465000.0
4        1600000.0
6        1876000.0
7        1636000.0
           ...    
12205     601000.0
12206    1050000.0
12207     385000.0
12209     560000.0
12212    2450000.0
Name: Price, Length: 6196, dtype: float64

y.describe()

count    6.196000e+03
mean     1.068828e+06
std      6.751564e+05
min      1.310000e+05
25%      6.200000e+05
50%      8.800000e+05
75%      1.325000e+06
max      9.000000e+06
Name: Price, dtype: float64

Choosing Features Selection¶

features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude'] # feature that we use to predict target
X = housing_df[features] # is Rooms, Bathroom, Landsize, Lattitude and Longitude
X

X.describe() # describing features

Model Building¶

DecisionTreeRegressor¶

from sklearn.tree import DecisionTreeRegressor # import Decision Tree Regressor as Regressor

Model Configuration¶

housing_model = DecisionTreeRegressor(random_state=1)

Training Model¶

housing_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

Prediction¶

housing_model.predict(X.head()) # predict and show top 5 prediction based on features

array([1035000., 1465000., 1600000., 1876000., 1636000.])

y.head() # compare with original value

1    1035000.0
2    1465000.0
4    1600000.0
6    1876000.0
7    1636000.0
Name: Price, dtype: float64

Model Evaluation¶

Mean Absolute Error¶

from sklearn.metrics import mean_absolute_error # Evaluation Metric with MAE

y_hat = housing_model.predict(X) # predict house price from some features inside X variable
mean_absolute_error(y, y_hat) # compare original value with prediction value

1115.7467183128902

Training module¶

from sklearn.model_selection import train_test_split # as splitter dataset into training and testing set

Split Dataset Into Training and Testing Set¶

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

Config and Train Model¶

housing_model = DecisionTreeRegressor(random_state=1)
housing_model.fit(X_train, y_train)

DecisionTreeRegressor(random_state=1)

Model Evaluation¶

y_hat = housing_model.predict(X_test) # predict in X_test variable
mean_absolute_error(y_test, y_hat) # evaluate y_test with y_hat

251688.7630729503

Model optimization¶

Function for Evaluation¶

def get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_hat)
    return mae

Create model with Decision Tree with Different parameter/node (max_leaf_nodes)¶

Comparing MAE value with set parameter to choose the the best one among the parameters¶

for max_leaf_nodes in [5, 50, 500, 5000]:
    leaf_mae = get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test)
    print(f'Max leaf nodes: {max_leaf_nodes} \t Mean Absolute Error: {int(leaf_mae)}')

Max leaf nodes: 5 	 Mean Absolute Error: 369673
Max leaf nodes: 50 	 Mean Absolute Error: 266644
Max leaf nodes: 500 	 Mean Absolute Error: 243613
Max leaf nodes: 5000 	 Mean Absolute Error: 256227

Random Forest Regressor Exploration¶

Importing RandomForestRegressor¶

from sklearn.ensemble import RandomForestRegressor # import Random Forest Regressor as Regressor

rf_model = RandomForestRegressor(n_estimators=100, random_state=1) # n_estimators=100 means using 100 Decision Tree
rf_model.fit(X_train, y_train) # train data
y_hat = rf_model.predict(X_test) # predict data
print(f'Mean Absolute Error: {int(mean_absolute_error(y_test, y_hat))}') # MAE betweeen y_hat & y_test

Mean Absolute Error: 190414

Study References¶

(In Indonesian)¶

kaggle 03 | Belajar Machine Learning Dasar dengan Python | Machine Learning Pemula Bahasa Indonesia¶

https://www.youtube.com/watch?v=DY10uyDy3vQ

Image: https://satterley.com.au/documents-and-media/images/vic/copy3_of_coburg-hill/lifestyle/upper_point_cook_display_village_point_cook.jpg/@@images/9f18f6cb-d7e9-4b79-9bf3-5b45e5bcb039.jpeg

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	...	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
0	Abbotsford	85 Turner St	2	h	1480000.0	S	Biggin	3/12/2016	2.5	3067.0	...	1.0	1.0	202.0	NaN	NaN	Yarra	-37.7996	144.9984	Northern Metropolitan	4019.0
1	Abbotsford	25 Bloomburg St	2	h	1035000.0	S	Biggin	4/02/2016	2.5	3067.0	...	1.0	0.0	156.0	79.0	1900.0	Yarra	-37.8079	144.9934	Northern Metropolitan	4019.0
2	Abbotsford	5 Charles St	3	h	1465000.0	SP	Biggin	4/03/2017	2.5	3067.0	...	2.0	0.0	134.0	150.0	1900.0	Yarra	-37.8093	144.9944	Northern Metropolitan	4019.0
3	Abbotsford	40 Federation La	3	h	850000.0	PI	Biggin	4/03/2017	2.5	3067.0	...	2.0	1.0	94.0	NaN	NaN	Yarra	-37.7969	144.9969	Northern Metropolitan	4019.0
4	Abbotsford	55a Park St	4	h	1600000.0	VB	Nelson	4/06/2016	2.5	3067.0	...	1.0	2.0	120.0	142.0	2014.0	Yarra	-37.8072	144.9941	Northern Metropolitan	4019.0

	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	13580.000000	1.358000e+04	13580.000000	13580.000000	13580.000000	13580.000000	13518.000000	13580.000000	7130.000000	8205.000000	13580.000000	13580.000000	13580.000000
mean	2.937997	1.075684e+06	10.137776	3105.301915	2.914728	1.534242	1.610075	558.416127	151.967650	1964.684217	-37.809203	144.995216	7454.417378
std	0.955748	6.393107e+05	5.868725	90.676964	0.965921	0.691712	0.962634	3990.669241	541.014538	37.273762	0.079260	0.103916	4378.581772
min	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1196.000000	-38.182550	144.431810	249.000000
25%	2.000000	6.500000e+05	6.100000	3044.000000	2.000000	1.000000	1.000000	177.000000	93.000000	1940.000000	-37.856822	144.929600	4380.000000
50%	3.000000	9.030000e+05	9.200000	3084.000000	3.000000	1.000000	2.000000	440.000000	126.000000	1970.000000	-37.802355	145.000100	6555.000000
75%	3.000000	1.330000e+06	13.000000	3148.000000	3.000000	2.000000	2.000000	651.000000	174.000000	1999.000000	-37.756400	145.058305	10331.000000
max	10.000000	9.000000e+06	48.100000	3977.000000	20.000000	8.000000	10.000000	433014.000000	44515.000000	2018.000000	-37.408530	145.526350	21650.000000

	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
Rooms	1.000000	0.496634	0.294203	0.055303	0.944190	0.592934	0.408483	0.025678	0.124127	-0.065413	0.015948	0.100771	-0.081530
Price	0.496634	1.000000	-0.162522	0.107867	0.475951	0.467038	0.238979	0.037507	0.090981	-0.323617	-0.212934	0.203656	-0.042153
Distance	0.294203	-0.162522	1.000000	0.431514	0.295927	0.127155	0.262994	0.025004	0.099481	0.246379	-0.130723	0.239425	-0.054910
Postcode	0.055303	0.107867	0.431514	1.000000	0.060584	0.113664	0.050289	0.024558	0.055475	0.032863	-0.406104	0.445357	0.062304
Bedroom2	0.944190	0.475951	0.295927	0.060584	1.000000	0.584685	0.405325	0.025646	0.122319	-0.053319	0.015925	0.102238	-0.081350
Bathroom	0.592934	0.467038	0.127155	0.113664	0.584685	1.000000	0.322246	0.037130	0.111933	0.152702	-0.070594	0.118971	-0.052201
Car	0.408483	0.238979	0.262994	0.050289	0.405325	0.322246	1.000000	0.026770	0.096101	0.104515	-0.001963	0.063395	-0.024295
Landsize	0.025678	0.037507	0.025004	0.024558	0.025646	0.037130	0.026770	1.000000	0.500485	0.036451	0.009695	0.010833	-0.006854
BuildingArea	0.124127	0.090981	0.099481	0.055475	0.122319	0.111933	0.096101	0.500485	1.000000	0.019665	0.043420	-0.023810	-0.028840
YearBuilt	-0.065413	-0.323617	0.246379	0.032863	-0.053319	0.152702	0.104515	0.036451	0.019665	1.000000	0.060445	-0.003470	0.006361
Lattitude	0.015948	-0.212934	-0.130723	-0.406104	0.015925	-0.070594	-0.001963	0.009695	0.043420	0.060445	1.000000	-0.357634	0.047086
Longtitude	0.100771	0.203656	0.239425	0.445357	0.102238	0.118971	0.063395	0.010833	-0.023810	-0.003470	-0.357634	1.000000	0.065988
Propertycount	-0.081530	-0.042153	-0.054910	0.062304	-0.081350	-0.052201	-0.024295	-0.006854	-0.028840	0.006361	0.047086	0.065988	1.000000

	Rooms	Bathroom	Landsize	Lattitude	Longtitude
1	2	1.0	156.0	-37.80790	144.99340
2	3	2.0	134.0	-37.80930	144.99440
4	4	1.0	120.0	-37.80720	144.99410
6	3	2.0	245.0	-37.80240	144.99930
7	2	1.0	256.0	-37.80600	144.99540
...	...	...	...	...	...
12205	3	2.0	972.0	-37.51232	145.13282
12206	3	1.0	179.0	-37.86558	144.90474
12207	1	1.0	0.0	-37.85588	144.89936
12209	2	1.0	0.0	-37.85581	144.99025
12212	6	3.0	1087.0	-37.81038	144.89389

	Rooms	Bathroom	Landsize	Lattitude	Longtitude
count	6196.000000	6196.000000	6196.000000	6196.000000	6196.000000
mean	2.931407	1.576340	471.006940	-37.807904	144.990201
std	0.971079	0.711362	897.449881	0.075850	0.099165
min	1.000000	1.000000	0.000000	-38.164920	144.542370
25%	2.000000	1.000000	152.000000	-37.855438	144.926198
50%	3.000000	1.000000	373.000000	-37.802250	144.995800
75%	4.000000	2.000000	628.000000	-37.758200	145.052700
max	8.000000	8.000000	37000.000000	-37.457090	145.526350