From data source to data analysis, my data science journey to complete my thesis for my Bachelor of Science (Real Estate). [ Part 1]

Joel K
10 min readMay 6, 2021

Introduction

The purpose of this is first as a personal archive of how I completed my thesis and the second is to aid RE juniors starting their thesis and are not sure how to start. I was once like that, and I thought a documentation of a senior’s work would be useful.

For my thesis, I chose a simple topic of finding the impact of proximity of industrial areas on HDB rent and resale prices. My hypothesis was that proximity to industrial areas commands a premium for HDB rent (because HDB rental market are largely made of low-medium skilled workers working in factories) and discount for HDB resale price (because the factory has higher traffic and more smoke/smell). The data needed to do this study would be broadly:

  1. Key attributes of HDB resale/rent transactions (age of HDB block, how many rooms, proximity to CBD) to act as control variables.
  2. Resale prices and rent
  3. Proximity of HDB blocks to industrial area as the testing variable. Proximity in my study was defined as

a. Shortest distance between HDB block to closest industrial block

b. Fastest travel time from HDB block to closest industrial block

Just a disclaimer that the documentation was not a complete reflection of my work process as new data came up along the way and my thesis direction was also modified. Hence while the steps below were a rough representation of my data collection process I often went back and forth in the steps.

Tools

R with Rstudio — Rstudio is a free IDE for R (a programming language). I use this mainly for linear regression. Another option is Stata as we were taught that during RE2801.

Python — This was used to webscrape data such as HDB rents from the HDB open rental market.

Tableau — I used Tableau for data visualisation. They make beautiful graphs and allow visualisation of location data easily. Some friends use excel for their graphs and GIS for visualization of location data. Students get a free year trial of the software, so can try this out if you like!

You may notice the tools I used are not typical because I deliberately choose to utilize such foreign tools to help with my work so I can hone my skillset in these software. So don’t feel too pressurized if you feel lost halfway in the article, this is quite normal as our course don’t really teach us how to use these tools. I had to self-learn many of these things.

Methodology

At the start, I attempted a geographic weighted regression (GWR). However, I found it not manageable because I did not have enough time to understand the statistics adequately to execute it well. If you have the time, you can try it out as it provides a more robust model than a multiple linear regression as it corrects for spatial autocorrelation of variables. If you are interested in GWR, here are some resources.

I ended up doing a multiple linear regression.

Data Collection

All data is available from public websites. Mainly HDB Map service, Singstat, Onemap and data.gov.sg. This is the general flow of data collection.

Figure 1: Overview of data collection

In case you are wondering why we need latitude and longitude data, they will be needed to generate proximity data. Now let us jump right in!

Step 1

Postal Codes

There are no official sites that will provide you with the collated postal codes of all HDB blocks in Singapore, however, data of all HDB blocks’ addresses are available in Data.gov. I retrieved the addresses from there and then used Singpost website to get the postal code. I ran a python script to iterate through the addresses to get the postal. Do note the dataset that provides the HDB blocks addresses also contains useful information about the HDB blocks such as the year it is completed and the town it is located in. We will use this data in step 3.

For those who have no experience in data scrapping, no worries, there are many articles online that can you teach you. A few key phrases you can search are ‘web scrape with selenium’ and ‘web scraper with python’, that was where I started.

Alternatively, there are data science projects online on HDB prices that have made their postal code data publicly available. I used one of these projects to cross-check my data.

After getting all postal code data, I merged it with the HDB property information dataset which was the dataset that provided all the addresses of the HDB blocks. Let us call the collated dataset, prop_Info. Subsequently, I will be merging more columns of data into prop_info.

The dataset was huge with 10,000 over entries. As I only looked at HDB blocks in the west region, I only kept entries of blocks in the western region. This can be done by filtering the data to choose only HDB blocks located in towns of the western region. You can do this by making the data into a table and then click on the ‘Towns’ header to choose only towns of the western region.

Figure 2: Prop_info dataset with postal codes

Step 2

Latitude and Longitude data

Latitude and longitude data can be obtained from Onemap using postal codes. Because the volume of data is so huge, I ran a script on Onemap to generate the latitude and longitude data. A python wrapper from Github was also used. Alternatively, GIS can be used.

Figure 3: Lat and long data with postal code

After getting the latitude and longitude, I added them into prop_info dataset.

Step 3

Key Attributes

To analyse real estate prices, we need to find out attribute values of both your testing variables ( in my case proximity to industrial areas) and control variables ( floor area, proximity to mrt, proximity to shopping centre etc) so you can isolate the effect of your testing variable. There are many control variables that you can find, and often for real estate it is impossible to find every single one. The advice I was given is to just pick and choose the ones that are highlighted as significant from your literature review. I took reference from a study by Dr Diao Mi titled ‘A Big Data–Based Geographically Weighted Regression Model for Public Housing Prices: A Case Study in Singapore’ which found out that age, floor area of the housing units, distance to the nearest park, distance to CBD and distance to the nearest MRT station are significant factors that affect public housing resale prices. Age is already in the dataset prop_info and distance to CBD can be proxied with the HDB towns. Hence all I needed to find was floor area, distance to the nearest park and MRT. Together with my testing variable, there are a total of 4 key attributes needed. We will tackle each attribute one by one.

1) Proximity of HDB block to industrial buildings ( testing variable)

To obtain proximity information such as the distance and travel time, the location of the industrial buildings needs to be identified first.

Location of industrial buildings

URA Space provides a map of land parcels allocated for industrial use. From these allocated spaces, I used Google maps to identify the industrial blocks that are sited on the land parcels as allocated space for industrial use does not necessarily mean the space is currently used. Since I was only looking at the shortest distance or fastest travel time, I only identified the peripheral industrial blocks. These peripheral blocks’ postal codes were collected and subsequently their latitude and longitude as well by google or the OneMap script.

Figure 4: URA Master Pan 2019 western region
Figure 5: Close up on Google Maps of the industrial blocks
Figure 6: Spatial Visualization of Industrial and HDB blocks used in the study on Tableau

Shortest distance between HDB block to one of the surrounding industrial blocks

To find the shortest distance, the haversine formula was used which calculates the Euclidean distance (ED) between 2 places given their latitude and longitude coordinates.

Figure 7: Haversine Formula (Do not worry about the formula you can just use the computer to calculate it for you)

To identify the shortest ED, the ED between the HDB blocks and all surrounding industrial buildings had to be calculated and then compared. You can do this 2 ways:

  1. Iterate the HDB block with all peripheral industrial blocks that are likely to generate the shortest ED. This can be done by segregating the industrial blocks into regions and then iterating the HDB block through the regions of industrial blocks that were likely to provide the shortest ED.
  2. Iterate the HDB block with every peripheral industrial block. This takes longer but if your iteration is very fast, this is not an issue. I used method 2 as distance computation is very fast. I used method one for travel time as that computation is longer.

After getting the shortest distance, record the value as a variable for the HDB block. I repeated this for all HDB blocks used in my study. Similarly, I used a python script to iterate through all the entries.

Fastest travel time from HDB block to surrounding industrial block

To find the fastest travel time, I used the OneMap API with the python wrapper. The API can calculate travel time between two postal codes. The API also accounts for the date and time when calculating travel time. To simulate travel time during peak hours, I used the traffic condition of Monday, 730am.

I used method 1 and iterate the data through regions and not all the industrial blocks.

2) Travel Time to MRT

The travel time to MRT can similarly be obtained through the OneMap API. A friend gave me the MRT postal codes, you can also obtain them by googling. This was the file given to me.

3) ED to Park

Similarly, park ED can be obtained through the haversine formula. I collected park location data from the Nparks website. As there are not many parks in the western region, I manually pick the parks and not use a script. Their latitude and longitude can similarly be found on google. Below is the script to calculate ED using the haversine formula.

4) Floor area

Floor area data is obtained from data.gov. However, it only contains floor area data of HDB blocks that had been sold, thus some HDB blocks’ floor area data is not obtainable. I removed those blocks from my dataset.

Next, the data of the floor area needs to be merged with prop_info. As the floor area data is categorized based on transactions and not by blocks, it is not as easy to merge the data, Furthermore, every block will have different type of flats (2 room, 3 room, 4 room etc) which will generate a different floor area data per block.

Below are the steps I took to obtain the floor area data.

Firstly, in prop_info I created 6 extra columns, one column for each possible room type. (1 room, 2 room, 3 room, 4 room, 5 room, executive).

I iterate the postal codes of prop_info with the resale dataset postal codes to find a match. When there is a match, the code checks the type of flat the transaction was, and then put in the data of the floor area of that transaction to the column that matches the flat type of the transaction. This is then repeated for the next transaction.

I do not have the exact code that followed the steps mentioned as my research process was lot messier. The code block below is the best I could find that exemplified what I did. The code firstly iterates through prop_info to put postal code information into the resale data set. Afterwards, it iterates through the data a few more times to add the floor area, flat model and lease start date data to prop_info.

Now let’s move on to merging the resale and rental data with prop_info!

Step 4

Resale and Rental Data

HDB resale data can be obtained from data.gov.

HDB rental data was collected from HDB map service through a web scrapping script.

After getting the resale and rental data, I combined the prop_info with the HDB resale and rental data set. This was done through iterating the postal of each resale/rent transaction and then trying to find a match for prop_info. When there is a match, the script copied over the whole row of info from prop_info to each transaction.

There are 2 blocks of code. The first code copies the main bulk of the information, the second one cleaned the data for easier analysis.

Here is an overview of the data sources.

If you have been following, congratz you are done with data collection and some data cleaning! In part 2 I will continue briefly on data cleaning and move on to linear regression.

And if you have found this article useful so far, would appreciate a clap on the article :)

--

--

Joel K

Interested all things real estate, social work and tech. And maybe a little bit of theatre too.