From data source to data analysis, my data science journey to complete my thesis for my Bachelor of Science (Real Estate). [ Part 2]

Joel K
5 min readMay 6, 2021

Welcome back! For this part, I will continue very briefly on data cleaning and then move on to linear regression.

Data Cleaning

The combined data had some abnormalities I had to clean. For example, in the rental data set there may be transactions that do not have property information. This was because the prop_info data set only contained blocks which was sold before (because I needed the floor area data and had to remove flats that were not sold before as they did not have the floor area data). Hence, for the rental data set, I removed the blocks without property information. It was important for both the resale and rental data to have the same HDB blocks as I was comparing the 2 models. Afterwards, I cross check the resale dataset to remove blocks that did not have rental data transactions. There were also a few blocks that the HDB map service had errors retrieving the data which were also omitted from the study. In total, 1,647 HDB blocks were included in the study with 8,021 rental transactions and 4,536 resale transactions.

Figure 8: Final Resale Data Set
Figure 9: Final Rental Data Set

Multiple Linear Regression

After getting the final data set, I did a regression on R with proximity as my testing variable. To ensure the results were robust I tested my data against the linear regression assumptions.

Assumption 1: Normality of errors

I tested this by visualizing the histogram of the errors. If the histogram is normally distributed, assumption 1 is met.

#normality of errorshist(residuals(key_vari_price_eqn_1)) #normality of errors
hist(residuals(key_vari_price_eqn_2)) #normality of errors
hist(residuals(key_vari_price_eqn_3)) #normality of errors
hist(residuals(key_vari_price_eqn_4)) #normality of errors
boxplot(residuals(key_vari_price_eqn_1)) #normality of errors
Figure 10: Histogram of errors. As seen, the residuals are normally distributed.

Assumption 2: Homoscedasticity

Homoscedasticity means the variance of all observation in the data are the same. You can test this by examining the scale-location plot. A horizontal line with equally spread out points should be what you hope to see . For my case, I did not manage to get the horizontal line. My residuals are clustered at the early part of the graph.

Figure 11: Residual spread
#homoscedasity 
plot(key_vari_price_eqn_1)
plot(key_vari_price_eqn_4)
plot(key_vari_price_eqn_7)
plot(key_vari_price_eqn_9)

Assumption 3: Linearity

Test this by drawing a scatterplot of the y factor with each x factor. If variables have a general linear shape, the assumption is met. It is however not realistic to get every factor to be linearly correlated, and most dissertations did not address this assumption, so I skipped this step.

Assumption 4: Multicollinearity

Calculate the generalised variance inflation factor (GVIF) for each variable to find out if there was multicollinearity. Values that are below three are typically regarded as quite ideal.

Figure 12: GVIF values

My values were ideal.

Assumption 5: Independence of errors

The Durbin Watson statistic test can be done to find out if the residuals were independent. A value of two indicates no correlation, below two indicates a positive correlation, and above two indicates a negative correlation. Usually, it is hard to get exact 2. Some sources say a value of 1.5 to 2.5 is ok. My result did not meet those requirements with a value of 1.058.

Figure 13: One of my Durbin Watson test values

To find out more about these assumptions and how to test them, here are the resources that I looked at. As you can see, my data did not meet some of the assumptions. However, I did not have time to statistically modify my data to fit the assumptions. I just note the unmet assumptions as limitations in my thesis.

After testing the assumptions, I went to do the regression. Since my testing variable was split into distance and travel time, I had one model for each variable.

Below is the result for the resale distance model. I will not show all the models as you can get the general idea with one model.

Figure 14: Regression table for Resale Distance Model

I hypothesize that the effect on rent and price varies with proximity, and hence I split the distance and travel time into different bands. I used excel to split the variables into their bands ( 100m, 200m, 300m bands etc).

The result for my distance band model is shown below.

Figure 15: Regression table for Resale Distance Band Model

Do note that the bands are dummy variables (they are used when your variable are not quantitative but qualitative, for eg male and female) hence, I coded the distance bands as “nearest_dist_band_indus_300.f”.

This is my code for the entire regression process.

Interpretation of Findings

If you are interested in the details, you can find my thesis here. But in short, my study found that the effect of proximity of industrial areas on the HDB resale and rental market follows an n-shaped curve. For example, if you see in figure 16, the value of -1.08% means that the rent for HDB flats that are within 0–4 minutes travel time to nearest industrial area is 1.08% cheaper than flats that are more than 24 minutes away from the nearest industrial area (the ‘control’ dataset as we assume that flats so far away from industrial area would not have their values affected by industrial areas). This value rises to 7.3% in the 8–12 minute time band and then falls again to 2.74% in the 20–24 minute time band.

Figure 16: Summary of results for Rental Market Time Band Model
Figure 17: Summary of results for Resale Market Time Band Model

Conclusion

Congratz you have reached the end of the article! If you find yourself quite overwhelmed it is alright, the process is not easy and it took me months of work. I was also further supported with knowledge in basic biz analytics and coding modules which I took up as electives. They really helped me with the data scrapping scripts and data cleaning. If you are interested in polishing up such skills, I would recommend taking BT1101, CS1010s. The modules are a lot heavier than real estate modules, but I believe it will value add to your learning and growth a lot. If you have any queries, post them in the comments, I will try to help as much as possible! And if you found the article useful, I appreciate a clap😊

--

--

Joel K

Interested all things real estate, social work and tech. And maybe a little bit of theatre too.