Dataset: Melbourne Housing

iris
About dataset: Melbourne Housing Market

Importing Modules

In [1]:
import pandas as pd # import modules pandas as pd

Load Dataset

In [2]:
housing_df = pd.read_csv('./dataset/melb_data.csv') # load file from csv .extension to be data frame
housing_df.head() # show the first 5 rows of data
Out[2]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
0 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 3/12/2016 2.5 3067.0 ... 1.0 1.0 202.0 NaN NaN Yarra -37.7996 144.9984 Northern Metropolitan 4019.0
1 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 4/02/2016 2.5 3067.0 ... 1.0 0.0 156.0 79.0 1900.0 Yarra -37.8079 144.9934 Northern Metropolitan 4019.0
2 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 4/03/2017 2.5 3067.0 ... 2.0 0.0 134.0 150.0 1900.0 Yarra -37.8093 144.9944 Northern Metropolitan 4019.0
3 Abbotsford 40 Federation La 3 h 850000.0 PI Biggin 4/03/2017 2.5 3067.0 ... 2.0 1.0 94.0 NaN NaN Yarra -37.7969 144.9969 Northern Metropolitan 4019.0
4 Abbotsford 55a Park St 4 h 1600000.0 VB Nelson 4/06/2016 2.5 3067.0 ... 1.0 2.0 120.0 142.0 2014.0 Yarra -37.8072 144.9941 Northern Metropolitan 4019.0

5 rows × 21 columns

Identify the Shape of the Dataset

In [3]:
housing_df.shape # show the dimension of the dataset with row and column
Out[3]:
(13580, 21)

Get the List of Columns

In [4]:
housing_df.columns # list of columns
Out[4]:
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

Identify Data Types For Each Column

In [5]:
housing_df.dtypes # datatype for every column
Out[5]:
Suburb            object
Address           object
Rooms              int64
Type              object
Price            float64
Method            object
SellerG           object
Date              object
Distance         float64
Postcode         float64
Bedroom2         float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
YearBuilt        float64
CouncilArea       object
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount    float64
dtype: object

Get Basic Dataset Information

In [6]:
housing_df.info() # information of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longtitude     13580 non-null  float64
 19  Regionname     13580 non-null  object 
 20  Propertycount  13580 non-null  float64
dtypes: float64(12), int64(1), object(8)
memory usage: 2.2+ MB

Identify Missing Values

In [7]:
housing_df.isna().values.any() # detect is there any NULL value in the dataset
Out[7]:
True

Identify Duplicate Entries/Rows

In [8]:
housing_df[housing_df.duplicated(keep=False)] # show all rows with duplication
Out[8]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount

0 rows × 21 columns

In [9]:
housing_df.duplicated().value_counts() # count the total of duplicated data
Out[9]:
False    13580
dtype: int64

Describe the Dataset (summary)

In [10]:
housing_df.describe() # data description
Out[10]:
Rooms Price Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
count 13580.000000 1.358000e+04 13580.000000 13580.000000 13580.000000 13580.000000 13518.000000 13580.000000 7130.000000 8205.000000 13580.000000 13580.000000 13580.000000
mean 2.937997 1.075684e+06 10.137776 3105.301915 2.914728 1.534242 1.610075 558.416127 151.967650 1964.684217 -37.809203 144.995216 7454.417378
std 0.955748 6.393107e+05 5.868725 90.676964 0.965921 0.691712 0.962634 3990.669241 541.014538 37.273762 0.079260 0.103916 4378.581772
min 1.000000 8.500000e+04 0.000000 3000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1196.000000 -38.182550 144.431810 249.000000
25% 2.000000 6.500000e+05 6.100000 3044.000000 2.000000 1.000000 1.000000 177.000000 93.000000 1940.000000 -37.856822 144.929600 4380.000000
50% 3.000000 9.030000e+05 9.200000 3084.000000 3.000000 1.000000 2.000000 440.000000 126.000000 1970.000000 -37.802355 145.000100 6555.000000
75% 3.000000 1.330000e+06 13.000000 3148.000000 3.000000 2.000000 2.000000 651.000000 174.000000 1999.000000 -37.756400 145.058305 10331.000000
max 10.000000 9.000000e+06 48.100000 3977.000000 20.000000 8.000000 10.000000 433014.000000 44515.000000 2018.000000 -37.408530 145.526350 21650.000000

Correlation Matrix

In [11]:
housing_df.corr() # correlation between column
Out[11]:
Rooms Price Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
Rooms 1.000000 0.496634 0.294203 0.055303 0.944190 0.592934 0.408483 0.025678 0.124127 -0.065413 0.015948 0.100771 -0.081530
Price 0.496634 1.000000 -0.162522 0.107867 0.475951 0.467038 0.238979 0.037507 0.090981 -0.323617 -0.212934 0.203656 -0.042153
Distance 0.294203 -0.162522 1.000000 0.431514 0.295927 0.127155 0.262994 0.025004 0.099481 0.246379 -0.130723 0.239425 -0.054910
Postcode 0.055303 0.107867 0.431514 1.000000 0.060584 0.113664 0.050289 0.024558 0.055475 0.032863 -0.406104 0.445357 0.062304
Bedroom2 0.944190 0.475951 0.295927 0.060584 1.000000 0.584685 0.405325 0.025646 0.122319 -0.053319 0.015925 0.102238 -0.081350
Bathroom 0.592934 0.467038 0.127155 0.113664 0.584685 1.000000 0.322246 0.037130 0.111933 0.152702 -0.070594 0.118971 -0.052201
Car 0.408483 0.238979 0.262994 0.050289 0.405325 0.322246 1.000000 0.026770 0.096101 0.104515 -0.001963 0.063395 -0.024295
Landsize 0.025678 0.037507 0.025004 0.024558 0.025646 0.037130 0.026770 1.000000 0.500485 0.036451 0.009695 0.010833 -0.006854
BuildingArea 0.124127 0.090981 0.099481 0.055475 0.122319 0.111933 0.096101 0.500485 1.000000 0.019665 0.043420 -0.023810 -0.028840
YearBuilt -0.065413 -0.323617 0.246379 0.032863 -0.053319 0.152702 0.104515 0.036451 0.019665 1.000000 0.060445 -0.003470 0.006361
Lattitude 0.015948 -0.212934 -0.130723 -0.406104 0.015925 -0.070594 -0.001963 0.009695 0.043420 0.060445 1.000000 -0.357634 0.047086
Longtitude 0.100771 0.203656 0.239425 0.445357 0.102238 0.118971 0.063395 0.010833 -0.023810 -0.003470 -0.357634 1.000000 0.065988
Propertycount -0.081530 -0.042153 -0.054910 0.062304 -0.081350 -0.052201 -0.024295 -0.006854 -0.028840 0.006361 0.047086 0.065988 1.000000

Data Visualization

Import Modules

In [12]:
import matplotlib.pyplot as plt # import maplotlib as plt for data visualization
import seaborn as sns # import seaborn as sns for data visualization

%matplotlib inline
# output from data visualization data will be directed to notebook

Heatmap

In [13]:
sns.heatmap(data=housing_df.corr()) # visualization using Heatmap
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bea2832460>

Bar Plot

In [14]:
housing_df['Rooms'].value_counts() # count every Suburb
Out[14]:
3     5881
2     3648
4     2688
1      681
5      596
6       67
7       10
8        8
10       1
Name: Rooms, dtype: int64
In [15]:
sns.countplot(data=housing_df, x='Rooms') # visualization using Bar plot with seaborn (colorized)
plt.tight_layout()

Pie Chart

In [16]:
housing_df['Rooms'].value_counts().plot.pie(autopct='%1.1f%%', labels=None, legend=True) # visualization using pie chart using percentage
plt.tight_layout()

Scatter Plot

In [17]:
sns.scatterplot(x='Bedroom2', y='Car', data=housing_df, hue='Rooms') # visualization using Scatter Plot
plt.tight_layout()

Case: show the widest landsize

In [18]:
housing_df.describe().loc['max', 'Landsize'] # find in row first then column
Out[18]:
433014.0

Data Cleaning

In [19]:
housing_df = housing_df.dropna() # drop NA / NULL
In [20]:
housing_df.shape
Out[20]:
(6196, 21)

Choosing Prediction Target

In [21]:
y = housing_df['Price'] # choose Price feature as prediction target then show it
y
Out[21]:
1        1035000.0
2        1465000.0
4        1600000.0
6        1876000.0
7        1636000.0
           ...    
12205     601000.0
12206    1050000.0
12207     385000.0
12209     560000.0
12212    2450000.0
Name: Price, Length: 6196, dtype: float64
In [22]:
y.describe()
Out[22]:
count    6.196000e+03
mean     1.068828e+06
std      6.751564e+05
min      1.310000e+05
25%      6.200000e+05
50%      8.800000e+05
75%      1.325000e+06
max      9.000000e+06
Name: Price, dtype: float64

Choosing Features Selection

In [23]:
features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude'] # feature that we use to predict target
X = housing_df[features] # is Rooms, Bathroom, Landsize, Lattitude and Longitude
X
Out[23]:
Rooms Bathroom Landsize Lattitude Longtitude
1 2 1.0 156.0 -37.80790 144.99340
2 3 2.0 134.0 -37.80930 144.99440
4 4 1.0 120.0 -37.80720 144.99410
6 3 2.0 245.0 -37.80240 144.99930
7 2 1.0 256.0 -37.80600 144.99540
... ... ... ... ... ...
12205 3 2.0 972.0 -37.51232 145.13282
12206 3 1.0 179.0 -37.86558 144.90474
12207 1 1.0 0.0 -37.85588 144.89936
12209 2 1.0 0.0 -37.85581 144.99025
12212 6 3.0 1087.0 -37.81038 144.89389

6196 rows × 5 columns

In [24]:
X.describe() # describing features
Out[24]:
Rooms Bathroom Landsize Lattitude Longtitude
count 6196.000000 6196.000000 6196.000000 6196.000000 6196.000000
mean 2.931407 1.576340 471.006940 -37.807904 144.990201
std 0.971079 0.711362 897.449881 0.075850 0.099165
min 1.000000 1.000000 0.000000 -38.164920 144.542370
25% 2.000000 1.000000 152.000000 -37.855438 144.926198
50% 3.000000 1.000000 373.000000 -37.802250 144.995800
75% 4.000000 2.000000 628.000000 -37.758200 145.052700
max 8.000000 8.000000 37000.000000 -37.457090 145.526350

Model Building

DecisionTreeRegressor

In [25]:
from sklearn.tree import DecisionTreeRegressor # import Decision Tree Regressor as Regressor

Model Configuration

In [26]:
housing_model = DecisionTreeRegressor(random_state=1)

Training Model

In [27]:
housing_model.fit(X, y)
Out[27]:
DecisionTreeRegressor(random_state=1)

Prediction

In [28]:
housing_model.predict(X.head()) # predict and show top 5 prediction based on features
Out[28]:
array([1035000., 1465000., 1600000., 1876000., 1636000.])
In [29]:
y.head() # compare with original value
Out[29]:
1    1035000.0
2    1465000.0
4    1600000.0
6    1876000.0
7    1636000.0
Name: Price, dtype: float64

Model Evaluation

Mean Absolute Error

In [30]:
from sklearn.metrics import mean_absolute_error # Evaluation Metric with MAE
In [31]:
y_hat = housing_model.predict(X) # predict house price from some features inside X variable
mean_absolute_error(y, y_hat) # compare original value with prediction value
Out[31]:
1115.7467183128902

Training module

In [32]:
from sklearn.model_selection import train_test_split # as splitter dataset into training and testing set

Split Dataset Into Training and Testing Set

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

Config and Train Model

In [34]:
housing_model = DecisionTreeRegressor(random_state=1)
housing_model.fit(X_train, y_train)
Out[34]:
DecisionTreeRegressor(random_state=1)

Model Evaluation

In [35]:
y_hat = housing_model.predict(X_test) # predict in X_test variable
mean_absolute_error(y_test, y_hat) # evaluate y_test with y_hat
Out[35]:
251688.7630729503

Model optimization

Function for Evaluation

In [36]:
def get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_hat)
    return mae

Create model with Decision Tree with Different parameter/node (max_leaf_nodes)

Comparing MAE value with set parameter to choose the the best one among the parameters
In [37]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    leaf_mae = get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test)
    print(f'Max leaf nodes: {max_leaf_nodes} \t Mean Absolute Error: {int(leaf_mae)}')
Max leaf nodes: 5 	 Mean Absolute Error: 369673
Max leaf nodes: 50 	 Mean Absolute Error: 266644
Max leaf nodes: 500 	 Mean Absolute Error: 243613
Max leaf nodes: 5000 	 Mean Absolute Error: 256227

Random Forest Regressor Exploration

Importing RandomForestRegressor

In [38]:
from sklearn.ensemble import RandomForestRegressor # import Random Forest Regressor as Regressor
In [39]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=1) # n_estimators=100 means using 100 Decision Tree
rf_model.fit(X_train, y_train) # train data
y_hat = rf_model.predict(X_test) # predict data
print(f'Mean Absolute Error: {int(mean_absolute_error(y_test, y_hat))}') # MAE betweeen y_hat & y_test
Mean Absolute Error: 190414

Study References

(In Indonesian)

kaggle 03 | Belajar Machine Learning Dasar dengan Python | Machine Learning Pemula Bahasa Indonesia