Enriching point data in GeoAnalytics Engine

SBattersby · ‎09-12-2023

Point locations are an overwhelmingly common input to spatial big data analytics – we may have points for customers, patients, field samples, GPS points tracking vehicles, and on and on. For people working in big data analytic environments, these points may number in the hundreds of millions or even billions. While these data are a necessary input for our spatial analyses, the raw data may not already have the attributes we need to fuel downstream visualization and predictive analysis.

Across the Esri suite of products, there are numerous built-in ways to add additional context and insights to your data using curated data sources and algorithms from Esri, for instance the ArcGIS GeoEnrichment Service that is part of the ArcGIS Platform, or the Enrich tools in Business Analyst. While these tools aren’t available in ArcGIS GeoAnalytics Engine, in this post, we’ll highlight some of the ways in which you can use your own contextual datasets in GeoAnalytics Engine to add value to point-level data to support big data analytic workflows. Specifically, we’ll explore:

Data enrichment based on shared location– each data point is enriched by the attributes of a polygon that it is inside
Data enrichment based on proximity – each data point is enriched by the attributes of features that are nearby
Data enrichment based on space & time – each data point is enriched with attributes of a polygon, but only when it coexists in both time and space
Data enrichment based on discrete bin systems – each data point is enriched with an ID for a common discrete bin system, such as Uber H3

Data enrichment based on shared location (spatial overlay)

The first example we’ll look at is one of the most basic: augment our point data based on the polygon in which they are located. For instance, if we want to understand the socioeconomic characteristics of customers with the largest annual purchases, we can use a spatial join to assign new attributes to our points, based on the census area in which the customer resides. To do this, we can enrich each of our customer points using the US Census Bureau’s American Community Survey census tract-level data for Neighborhood Socioeconomic Status (NSES) that is available through the ArcGIS Living Atlas. Feature Services from the Living Atlas, or other ArcGIS services, can be read directly by GeoAnalytics Engine to create a data frame for use in analyses.

Points enriched based on the census tract they are within

The simplest way to enrich the customer points with the tract-level NSES data is to use a spatial join to identify the census tract that each point falls inside. To do this in GeoAnalytics Engine we just need two data frames – one for our points (df_customers) and one for our polygons (df_nses), and an overlay operation to define the spatial relationship between the data, like ST_Within.

df_customers_enrich = df_customers.join(df_nses, ST.within("geometry_point", "shape"))

Utilizing the ST_Within function in our join results in a new data frame with each of the customer points enriched with all of the NSES data. For example, here is one record:

Attribute table for a single enriched point

As an interesting side note, it is worth mentioning how GeoAnalytics Engine streamlines the process of working with spatial data. Take note of the attributes mentioned above for the “geometry_point” (bottom of the black box at the top), and the “shape” (bottom of the yellow box). The point data for the customers use coordinates of latitude and longitude (in WGS84), while the polygons from the NSES feature service are in web Mercator coordinates (meters). We didn’t have to transform the coordinate system for our data frames to perform the spatial join because GeoAnalytics Engine can project the data on the fly. The spatial intersection is calculated correctly when the coordinate systems are known for all of the inputs.

Data enrichment based on proximity

While enrichment based on points being contained in polygons is probably the most common approach, there are other great ways to add value to your data using GeoAnalytics Engine. Maybe you don’t need to know about the characteristics of the census tracts for your points, but you really want to know something distance-based, like the attributes of and distance to the closest store, school, transit stop, etc. In this case, we want to enrich based on nearest neighbor.

Let’s look at an example using point of interest data from the Overture Maps Foundation. We will find the three nearest libraries to each school using the Nearest Neighbors tool in GeoAnalytics Engine and enrich our schools’ data with the distance to each of the three nearest libraries. The result of this can help us understand if there are disparities in resources available to the students at different schools, and if there are spatial patterns that would highlight under-served locations.

Using Nearest Neighbors, we can integrate these two sets of points in their respective data frames to find all nearest neighbors like this:

from geoanalytics.tools import NearestNeighbors

libraries_near_schools = NearestNeighbors()\
    .setNumNeighbors(3)\
    .setSearchDistance(1, "mile")\
    .setResultLayout("long")\
    .run(seattle_schools, seattle_libraries)

This will find up to the three closest libraries (setNumNeighbors) within 1 mile (setSearchDistance) from each school, giving us a result in long format like this (multiple records for each school; one for each library that is nearby):

Nearest neighbors results, long format

Or we could return the data in a wide format with a single record for each school and separate columns for each nearby library:

Nearest neighbors results, wide format

We can also look at the results as a map to see the distribution of schools (black) and libraries (red), with connecting lines showing the closest to each school. We can do this in GeoAnalytics Engine using the results from our Nearest Neighbors calculation. Since each result row in the table has the geometry from each input (Schools & Libraries), we can simply connect the points for each using the ST_ShortestLine function:

# plot the schools
plt = seattle_schools.st.plot(**sea_style, color="black", marker_size=10)

# create the connector lines
libraries_near_schools.select(ST.shortest_line("geometry","geometry1"))\
.st.plot(ax=plt, linewidth=1, color="grey")

# plot the libraries
seattle_libraries.st.plot(ax=plt, color="red", marker_size=12)

Nearest neighbors as a map

These values can then be joined back to the original school dataset using the School ID as a common key to enrich the schools with information about their closest libraries:

seattle_schools = seattle_schools.join(libraries_near_schools_wide_join, on=(F.col("id")==F.col("SchoolID")))

This results in a table with all of the new distance attributes to enrich our original schools data with library proximity details – here is an example of one record for the Middle College High School and the three closest libraries:

nearest neighbors result joined to schools dataset

We can now use this information in our schools dataset for any of our subsequent analyses, for instance, to identify spatial gaps where students don’t have sufficient local access to library resources.

Data enrichment based on spatial and temporal relationships

Spatial relationships are a primary consideration when working with data in GeoAnalytics Engine - it is after all designed to deliver spatial analysis to your big data. But data exists in both space and time, and sometimes it is important to constrain our relationships to where we only have an overlap in both location and time. For instance, we might want to find all of the locations where a set of migrating vultures have traveled near an active weather watch or warning to understand how migration patterns are impacted by changes in weather and climate. Using this information, we can enrich the vulture migration dataset so that each migration observation point gets the attributes of the weather watch or warning that it was near.

For this example, we’ll use the Movebank Vultures Acopian Center USA GPS (2003-2021) data and five years of data from the National Weather Service watch, warning, and advisory events made available from the Iowa State University Iowa Environmental Mesonet. The vulture dataset involves almost 2 million records with location and time tracked and recorded for 78 individual vultures. The NWS watch, warning, and advisory data for 2010-2014 has 2.1 million records.

vulture migration, storm watches and warnings

We can calculate the relationship in space and time using a Spatiotemporal Join where we set parameters for the specific relationships that we want. For instance, this will look for any vultures that were within 25 miles of a watch or warning polygon, but only when the watch or warning was active.

vulture_watches = SpatiotemporalJoin()\
    .setLeftJoin(left_join=True)\
    .setJoinOneToMany()\
    .setSpatialRelationship(spatial_relationship="NearGeodesic", near_distance=25, near_distance_unit="miles" )\
    .setTemporalRelationship("During")\
    .run(target_dataframe=df_vulture, join_dataframe=df_nws_polys)))

This identifies 70 times when vultures were near or within an active watch or warning and enriches the vulture dataset with the details for the matching watches and warnings. At this point we have an enriched data table that provides details like this:

vulture migration data enriched with watch and warning data

We can also see the results graphically, with the relevant watches / warnings (red) mapped with the vulture tracks (green points and grey lines) and the times when the vultures were near to an active watch / warning (black points).

vulture migration with weather watch or warnings near in space and time

Data enrichment using discrete bin systems

Enriching data doesn’t always mean that you are bringing in attributes from one or more geographic data sets using spatial or spatiotemporal joins. Sometimes we just need a simple identifier that can be calculated automatically without needing secondary data sources.

For instance, we may want to enrich our data with a key field to work seamlessly with data from another provider, such as the Placekey product from Safegraph. Placekey is a standardized identifier for defining physical place locations and is based on the Uber H3 grid system. To use our data in conjunction with Placekey data, we need to enrich it to add an H3 ID – but we don’t need to bring in any special dataset for a spatial join. We can do this simply in GeoAnalytics Engine using the ST_H3Bin function to enrich our data with the H3 ID; we just need to specify the geometry field and the H3 resolution.

As an example, we’ll add an H3 ID to the Seattle places data from Overture Maps Foundation that we looked at when we explored enriching by proximity. If I wanted to add an H3 ID for each of those points it would look like this:

df_places_seattle = df_places_seattle\
    .withColumn("H3_id", ST.bin_id(ST.h3_bin("geometry", 13)))

A resolution of 13 was used for this example, which has an average hexagon size of 43.8 m^2. You may want to pick a resolution with larger or smaller hexagons depending on your need. A nice chart with the details on the different resolutions is available here in the H3 library documents.

Once the new column has been added to enrich our points with H3 ids, we have a result that looks like this, and it’s now ready to use with Placekey or any other data that includes H3 index IDs.

One location, enriched with H3 identifier

After your data is enriched

Once our data is enriched using one or more of the methods above, we can move on to the next step in our analytics or visualization workflow. This might include using the newly enriched attributes in a machine learning model, creating index values or other derivatives based on the new attributes, or aggregating the data for visualization or other reporting. Using the techniques above, you now have multiple ways to enrich your data to generate the perfect data for your needs!

Let us know about your interesting analytics and how you are enriching your data using the functionality of GeoAnalytics Engine!