Using data science to choose retail investment markets
When a real estate investor decides on a potential market to deploy capital, they often consider various metrics and characteristics of the market in which they might be interested. For example, the investor may assess the market’s future potential investment returns on real estate. Additionally, they may consider risk metrics, such as levels of liquidity, transparency, and vacancy for a market. Other important metrics include economic growth prospects and the supply pipeline.
With so many variables to examine, how might a property investor start choosing which geography to explore on a deeper level for potential opportunities, especially when there are so many markets to choose from? In this blog, I explore the use of data science and machine learning to make this process a bit easier. I employ a few machine learning techniques[1] to qualify, cluster and categorise markets according to their numerous metrics and characteristics. For this analysis, I look at potential investments into retail mall assets across cities in Asia Pacific.
To begin the analysis, I examine data for the following metrics across 33 cities in Australasia (6), Greater China (13), India (6), Southeast Asia (7) and South Korea (1).
Figure 1: List of metrics examined.
Macroeconomic (Source: Oxford Economics, unless otherwise noted)
Population 5-year growth forecast
 GDP 5-year growth forecast
 Retail sales 5-year growth forecast
 Current median income level (and 5-year growth forecast)
 Pre-COVID international tourism levels (2018) (Source: Euromonitor & other sources)
Real estate investment fundamentals (Source: JLL, unless otherwise noted)
 Market liquidity based on transaction volumes since 2015 (Source: MSCI RCA)
 Global real estate transparency index (2022)
 Current risk-free rate, proxied by 10-year government bond yield (Source: Oxford Economics)
 5-year net rent growth forecast
 Historical 5-year net rent growth volatility
Real estate supply (Source: JLL)
 Current retail space per capita
 Current retail space per retail sales
 Current and forecast 2027 vacancy rate
 5-year supply pipeline
Because there are many variables to look through, it is useful to employ a machine learning technique[2] to group similar variables together and reduce the size of the dataset so that the results are ultimately easier to interpret. After doing so, another machine learning algorithm[3] is used to identify clusters of markets that roughly share similar characteristics. The results are shown below.
Figure 2: Using PCA and k-means clustering to analyse retail markets.
Source: JLL, Oxford Economics, Euromonitor, MSCI RCA, and other market sources (data collected as of Feb 2023)
Here, we visualise the results of the analysis using two variable groupings – one focused on variables which indicate economic development levels and the other focused primarily on macroeconomic and real estate market characteristics. If we categorise the cities into four clusters, we see that major retail hubs, such as Hong Kong and Singapore, fall in the same cluster as they have higher economic development levels along with a stronger international tourism market. Meanwhile, cities from the same country or subregion also tend to cluster together as they share similar characteristics. For example, mainland Chinese markets fit neatly within the same cluster, and yet also separately from Indian cities, given that Chinese markets generally have higher economic development levels, lower population growth and higher retail investment market liquidity than their Indian counterparts.
This analysis gives investors a better idea of differentiating characteristics of different retail markets across the region and how similar locations cluster together, before they proceed with their investment decisions. While this analysis looks at cities only on a macro market level, it provides another possible way to examine real estate markets and could serve as a helpful first step before enabling a deeper analysis at the asset level.
[1] These techniques include principal component analysis (PCA) and k-means clustering.
 [2] PCA is used to group similar variables together and reduce the size of a dataset.
 [3] The k-means clustering technique is used to identify and group observations that roughly share similar characteristics to one another.