About the Data

mkemeter · ‎2020 Jun 24

This blog is about analyzing the same dataset with and without considering the location dimension in order to quantify the benefit of handling spatial data.

Well, the title is a trap. Actually this blog is all about using location data in machine learning models. However, instead of preaching that the location dimension is equally important as time, I would like to show you an end-to-end example of analyzing the same dataset with and without location information. Eventually, we will compare two trained ML models and will quantify the impact of location information on our prediction quality.

So, cards on the table! This time it's not just preaching!

Along the same lines, you will find the full Jupyter Notebook to replicate the example on my GitHub Repository. By going into the details, you will learn how to

Upload a csv containing spatial data (housing sales) to HANA Cloud

Train a regression model on the housing price without spatial features

Retrieve and upload OpenStreetMap data for the respective areas

Compute spatial features by using database functions

Train the same regression model including those spatial features

Compare both models

Predict the price of a reference house across different locations across the city

Also, I will use some brand new features of SAP HANA Spatial, which have been introduced with HANA Cloud and HANA2 SPS05. Following features are covered by the example:

Creation of predefined Spatial Reference System. HANA Cloud and HANA2 SPS05 carry the definitions of more than 9000 SRS under the hood. Simply issue the following statement to create a predefined SRS:
```
CREATE PREDEFINED SPATIAL REFERENCE SYSTEM IDENTIFIED BY <srs-id>
```

Voronoi Tessellation. With HANA Cloud and HANA2 SPS05 we introduced the in-database Voronoi tessellation. SAP HANA is the first database to deliver Voronoi cells by preserving the relation between input points and output cells, which enables sophisticated advanced analytics usecases. Check out this video to learn more. Voronoi cells can be calculated by either calling ST_ClusterCell on k-Means cluster or using the ST_VoronoiCell window function on point sets.
```
ST_VoronoiCell(<point_column>, <margin>) OVER (<optional_over_parameters>)
```

K-Means Clustering. This clustering method can be used to split points into a fixed number of clusters by minimizing the squared distance of each point to its cell centroid.
```
SELECT COUNT(*), ST_ClusterID(), ST_ClusterCell() FROM table

    GROUP CLUSTER BY point USING KMEANS CLUSTERS 30
```

Hexagonal Clustering. The useful feature of in-database hexagonal clustering has already been discussed in many other blogs. This time, the clustering will not be used as a location proxy for ML models, but rather for map visualizations and generating a grid of points.
```
SELECT COUNT(*), ST_ClusterID(), ST_ClusterCell() FROM table

    GROUP CLUSTER BY point USING HEXAGON X CELLS 5;
```

Last but not least the "Spatial bread-and-butter":
- Creation of ST_Geometry columns [doc],
- transformation between SRS [doc],
- spatial aggregations [doc],
- joins using spatial predicates [doc],
- exporting geometries as text [doc],
- distance calculations with LineStrings and Polygons [doc],
- calculation of centroids [doc] and more

About the Data

The dataset for this example can be downloaded on Kaggle. It contains almost 35000 housing sales from years 2016, 2017 and 2018 in the City of Melbourne, Australia. Unfortunately only 27000 are geo-referenced with proper longitude and latitude information. Our analysis will be based upon those 27000 records.

We will use the data of 2016 and 2017 to train our model (86% of records). The model will then be applied to the data of 2018 (14% of records) for testing the prediction quality.

Year	#Records
2016	6397
2017	12019
2018	2577

In the initial dataset there are already a few columns included, which act as a proxy for geographic location information. These are suburb, address, distance (to city center), postcode, councilarea, regionname and propertycount (within the neighbourhood). We dismiss all those columns to train a model which is solely relying on the attributes of the house itself. Afterwards we will use only latitude and longitude information to generate similar (or even better) spatial features on a database level.

Input data without geo-spatial fields (except lat/lon)

These are the features, that we will consider in the following:

id - identifier for the record

type - type of property
- h - house,cottage,villa, semi,terrace
- u - unit, duplex
- t - townhouse

rooms - number of rooms

bedroom2 - number of bedrooms

bathroom - number of bathrooms

car - number of carspots

landsize - land size in sqm

buildingarea - building size in sqm

yearbuilt - construction year

shape - lat/lon information

price - the sales price in Australian Dollars

Our database representation of latitude and longitude will be a column of type ST_GEOMETRY(28355). SRS 28355 is a planar spatial reference system, which is suitable for Australia. This SRS is not installed per default on SAP HANA. However, with the latest release, the database already knows the definition of the SRS and we simply need to issue the following statement to make it usable:

CREATE PREDEFINED SPATIAL REFERENCE SYSTEM IDENTIFIED BY 28355

With this new feature, prior approaches such as the metadata installer tool or the creation of spatial reference systems by its definition got obsolete for EPSG Spatial Reference Systems. The units of measure, which are used by the predefined SRS, also come out of the box. No need to install them separately.

(Note that it is still possible to create spatial reference systems by its definition. It is just not required, when using the predefined SRS)

ML Model without Location Information

As a reference, we will first train an XGBoost regression model for the price incorporating the features listed above (except field shape, of course). To explain the model and understand the impact of features/attributes on the price, we leverage the SHAP library.

The median absolute error of our model is $220k on the training data (year 2016 and 2017) and $300k on the test data (year 2018). To set that into relation, we should note that the median housing price in the dataset is $910000. With this we can draw to conclusions:

We are more than 30% off on our test dataset

The behaviour of our model on training and test dataset differs significantly, which means that our model is not very robust for new data.

Still, let's take a look at the model explanation by SHAP and the importance of features:

Impact of non-spatial features

How do we read this graphic? The color corresponds to the value of the individual features. Red dots are records with a high value and blue dots records with a low value. The x axis describes the level on influence on the price (regression target). The vertical line at zero behaves 'neutral'. The sorting of features corresponds to the importance for the model. This most important variable in this case is yearbuilt. Let's look at two examples to make that clearer:

yearbuilt. The higher the value of yearbuilt, the lower the price. This means old houses tend to be more expensive. What first sounds counter-intuitive (at least if it's not a historic building) can possibly be explained by the fact, that older houses are typically located in long-established and thus central regions. New houses are often located in development areas in the suburbs. So eventually the spatial dimension may have sneaked in into our non-spatial model...

type_t. The fact that the house is a townhouse does only have a low impact on the outcome. The values are all centered around 0. The fact if a property is only a unit (type_u) or a detached house (type_h) is far more important.

Compute Spatial Features

Now, let's see if we can improve the model by enhancing our dataset with geo-spatial features. The geo-spatial features in our case can be divided into 3 sub-categories:

Based on OSM Point of Interest data

Artificial Region Proxy

Areas of Interest

OpenStreetMap Point of Interest Data

To understand the surroundings and nature of the different neighbourhoods, we first of all retrieve and upload poi data from OpenStreetMap using the OSMnx library. To determine the query range, we first check which area is covered by our dataset using ST_ConvexHullAggr.

SELECT ST_ConvexHullAggr(SHAPE) FROM MELB_HOUSING_PREDICT_VIEW

Convex hull of all records in the dataset

Within the polygon above, we search for pois with tags amenity, shop or airport. Based on the OSM attributes, we categorize the pois into following types: entertainment, health, education, childcare, community, emergency, grocery, shopping, airport.

Sample poi data

Overall we downloaded 50000 pois and selected around 15000, which we found relevant based on the categorization above. We can use mapping plugins, like Folium, to create heatmaps for different poi types.

Folium heatmap of entertainment pois

Based on the poi data, we generate a bunch of features, which describes the distance of each house to the next poi of a given type as well as the density of pois around the listing.

We can use the following SQL view to efficiently query the next poi of a given type.

CREATE OR REPLACE VIEW MELB_HOUSING_CLOSEST_POI (IN POI_TYPE VARCHAR(100)) AS

(

    SELECT A.ID, A.SHAPE.ST_DISTANCE(B.SHAPE_PT) AS DIST_POI

    FROM MELB_HOUSING_PREDICT_VIEW A

    LEFT JOIN

    (

        SELECT OSMID, SHAPE_PT, ST_VoronoiCell(SHAPE_PT, 10.0) OVER () AS VCELL

        FROM MELB_OSM_POI_VIEW

        WHERE POI_TYPE = :POI_TYPE

    ) B ON A.SHAPE.ST_INTERSECTS(B.VCELL) = 1

)

Note, that we first (inner select statement) used the Voronoi clustering to generate Voronoi cells for each poi of a certain type. The Voronoi cell covers the area, where each point is closer to the centroid (i.e. poi) of this cell than to any other centroid (i.e. poi). If we now intersect the coordinate of the house, with the Voronoi cells, we immediately get the closest POI by retrieving the record of the cell's centroid. This is a nice example of using Voronoi cells for a non-visual purpose.

For counting the POIs of a certain type within a given distance (e.g. 1000 meter) around a listing, we use the following SQL view.

CREATE OR REPLACE VIEW MELB_HOUSING_COUNT_POI(IN POI_TYPE VARCHAR(100), IN DISTANCE_METER INTEGER) AS

(

    SELECT A.ID, COUNT(*) AS CNT_POI

    FROM MELB_HOUSING_PREDICT_VIEW A

    JOIN MELB_OSM_POI_VIEW B ON A.SHAPE.ST_WITHINDISTANCE(B.SHAPE,:DISTANCE_METER) = 1

    WHERE :POI_TYPE IS NULL OR B.POI_TYPE = :POI_TYPE

    GROUP BY A.ID

)

Artificial Region Proxy

In our initial dataset we dropped columns, such as suburb, which is a geo-spatial identifier for a region. However, a suburb or district has a political or administrative dimension. The area covered may change over time. On the other hand each of us knows, that this is important information for housing prices. There are simply expensive districts, where we sometimes may get the impression of paying for an address rather than a house.

We want to avoid this administrative dimension and chose a data-driven approach to infer regions from the data. In similar examples, I have used hexagonal clustering and used the individual cells as a proxy for a neighbourhood. This time I want to limit the number of regions and at the same time infer a meaningful clustering based on the density of listings.

The following SQL view is used to generate 30 k-means clusters and the Voronoi cells of their centroids. Each area/cell is given a name containing a simple ID (REGION-xx).

CREATE OR REPLACE VIEW MELB_REGION_PROXY AS

(

    SELECT

        ST_ClusterID() AS CLUSTER_ID,

        'REGION-' || ST_ClusterID() AS CLUSTER_NAME,

        ST_ClusterCell() AS CLUSTER_CELL

    FROM MELB_HOUSING_PREDICT_VIEW

    GROUP CLUSTER BY SHAPE USING KMEANS CLUSTERS 30

)

30 regions based on k-means clustering

Areas of Interest

In each city there are certain areas of interest, where the proximity to those areas influences the housing price. For the Melbourne example, we want to use two 'areas':

CBD - the Central Business District of Melbourne

Port Phillip - the bay/coastline of Melbourne

CBD and Coastline as Polygon/Linestring

The CBD is in the very center of the city. It is itself a large area. If we would measure proximity only to a point in the center of the city, we would run into the risk of not correctly reflecting, that the "center" of a city is usually not a point, but a larger area. For this reason we calculate the closest distance from each house to the CBD polygon - and the coastline as well, as this is certainly a driver for housing prices - by using ST_Distance with polygons (CBD) and linestrings (Coastline).

The coastline is a linestring with a rather high precision. To measure the proximity of a house to the bay, it will be sufficient to reduce the number of points of the linestring while preserving the characteristic. We can use ST_Simplify to do so. The parameter determines, that the new (reduced) coastline should not deviate more than 100 meter from the original one.

SELECT ST_GeomFromText(:bay_poly, 3857).ST_Simplify(100) FROM DUMMY

ML Model including Location Information

Now let's bring all the non-spatial and spatial features together, to see if we can improve our housing price prediction, that we conducted earlier. We added following geo-spatial features:

Distance to Central Business District

Distance to Coastline/Bay

Region (identified by id)

Closest poi of type: entertainment, health, education, childcare, community, emergency, grocery, shopping, airport

Number of pois of the same categories within a distance of 1000 meter.

To be clear: For the pois, we generate one feature per type (e.g. dist_entertainment, cnt_entertainment, dist_healt, cnt_healt, etc.).

The median absolute error of our new model is $135k on the training data (year 2016 and 2017) and $130k on the test data (year 2018). Conclusions:

We are 14% off on our test dataset

The behaviour of our model on training and test dataset is similar. In fact it performs slightly better on test.

Also let's again look at the variable contributions.

Impact of spatial and non-spatial features

The impact of spatial features:

dist_cbd - the distance to the CBD is the most important feature. The closer to the center, the higher the price.

dist_bay - distance to the bay is the second most important spatial feature. House at the bay are more expensive.

region - region 22, 11, 29, 25, 30 tend to be more expensive, whereas 5 and 7 tend to be cheaper.

pois - the distance to the airport as well as the density of bars and restaurants have a certain (but smaller) impact.

Comparing both Models

Well, that's easy. Let's get the numbers mentioned above into one table.

	Non-Spatial	Spatial	Gain
MAE on Test	$300k	$130k	$170k
Error Ratio	33%	14%	19%
MAE Deviation Train/Test	25%	5%	20%

Bottom line: The enhancement with spatial features significantly improved the precision and robustness of our model.

Was this surprising? By no means! Think of what would happen, if your ML model or reporting ignores the time dimension for temporal data. Eventually your results would be useless... The same happen when dealing with data, that carries a spatial dimension.

Remember that also the non-geo-spatial model had the yearbuilt as most influencing factor, which can be explained with the fact, that old buildings tend to be in the city center. In this case the yearbuilt would have acted as a proxy for geo-spatial proximity to the city center (i.e. dist_cbd).

Predicting the Price of a Reference House

We could stop here. We have shown, that the handling of spatial data is worth the effort and that in fact the effort is not as high as you would expect, if you are using HANA's in-database handling of spatial data. That was what I wanted to show in the first place.

However, having the right model at hand, we could now be curious, what the isolated impact of the spatial dimension on the housing price is.

Given the exact same house in different areas of the city. How would the price differ?

Before we all go to bed or continue working - let's answer this very last question. By looking at the base data we can construct the "average house".

CREATE OR REPLACE VIEW MELB_HOUSING_PREDICT_VIEW AS

(

    SELECT

        ST_ClusterId() AS ID,

        3 AS ROOMS,

        'h' as TYPE,

        3 as BEDROOM2,

        2 as BATHROOM,

        2 as CAR,

        600 as LANDSIZE,

        160 as BUILDINGAREA,

        1965 as YEARBUILT,

        ST_ClusterCell().ST_Centroid() AS SHAPE,

        ST_ClusterCell() AS SHAPE_CELL,

        2020 as YEAR    

    FROM MELB_HOUSING

    GROUP CLUSTER BY SHAPE USING HEXAGON X CELLS 75

)

With this view we spanned a grid of 75 hexagonal cells in x direction across the city. We can use the records above to do a prediction for the centroid of each cell. For the data at hand, that are 1301 price predictions for the exact same house at different locations.

We can use the hexagonal cell of the prediction centroid to visualize the distribution across the city. Since we use the same data preparation as before, it considers all our spatial features (pois, area of interest and regions) while fixing the non-spatial features to static values.

Price prediction for a reference house (red=expensive)

In-line with what we have identified during model debriefing, we see hotspots in the center of the city as well as on the coastline. We are also able to see expensive regions, which are neither close to the city nor to the coastline. Depending on the location the price for our 160sqm house varies between 330k Australian Dollar and 2.5m Australian Dollar.

If we go down the coastline to the very south, we will see a cell, where our reference house is worth 1.9 million Australian Dollar.

Coastline prices

I leave it up to you to check the satellite imagery, if this makes sense.... I did for sure.

By Category

Related Content

Activity Groups

Industry Groups

Influence and Feedback Groups

Interest Groups

Location Groups

Customer Only Groups

Forums

Related Resources

Products

Learning and Support

About

My SAP Profile

My SAP Profile

On Machine Learning Without Location Data

About the Data

ML Model without Location Information

Compute Spatial Features

OpenStreetMap Point of Interest Data

Artificial Region Proxy

Areas of Interest

ML Model including Location Information

Comparing both Models

Predicting the Price of a Reference House