Generating network paths from high volume movement data

854
0
08-16-2023 07:10 AM
SBattersby
Esri Contributor
3 0 854

Working with raw GPS data points can be challenging due to the volume of data and margin of error associated with GPS coordinate accuracy. ArcGIS GeoAnalytics Engine has a number of tools to help in processing and converting GPS data points so they can be used more easily in your analytics - for instance, the SnapTracks tool to "snap" GPS points to a road network, ReconstructTracks to connect sequential points into tracks, or Find Dwell Locations to identify when an asset being tracked has been stationary within set time and distance boundaries. While these tools are incredibly useful for many datasets, some questions and data types require more complex methodologies – like how can you convert the cloud of GPS points for hundreds or thousands of vehicles into a coherent line representing the central path?  We’ll explore this problem in the context of identifying shipping lanes, using hundreds of millions of data points from the Automatic Identification System (AIS) for shipping vessels. 

AIS_point_bin_track.jpg

 

For our analytics, we will incorporate two new features from the GeoAnalytics Engine 1.2 release that make the process significantly easier: ST_Centerline and ST_BboxIntersects.   

 

About the data 

The AIS data that we are using consists of GPS locations for shipping vessels near the U.S. coast. The data are available from MarineCadastre.gov and include details on location, time, vessel type, speed, length, beam, and draft. For this example, we will start with ~242 million raw data points (one month of data).    

AIS_point_data_012023.jpg

 

To more clearly demonstrate the method for creating a set of single linestrings representing the main shipping lanes, we’ll simplify to focus on data in one area around the Galveston Bay in Texas, but at the end we’ll look at performance and scaling up the analyses to demonstrate the efficiency of working in GeoAnalytics Engine. 

AIS_point_data_Galveston_Bay.jpg

 

Preparing the data for analysis 

The first step in just about every analytics workflow is to clean and prepare the data for analysis. For this workflow we are going to: 

  • Read in the data 
  • Create geometry and timestamp fields so that we can identify paths based on location and time 
  • Filter on attributes to only include the cargo and tanker class vessels 
  • Adjust the coordinate system to an equal area projection so that when we spatially bin the data the bins will be the same area 
  • Spatially filter the data to just the Galveston Bay area 

 
This initial step of reading in the data and preparing it for analysis looks like this: 

 

 

# read AIS observations 
observations = spark.read.parquet(f"{data_dir}/{parquet_file}") 

# create geometry and timestamps 
observations = observations\ 
  .select( 
        "MMSI", 
        F.to_timestamp("BaseDateTime").alias("timestamp"), 
        ST.point("lon", "lat", 4269).alias("geometry") 
  ) 
# filter to cargo and tanker class vessels 
observations = observations.where("vesseltype between 70 and 89") 

# transform to projected coordinate system 
world_cylindrical_equal_area = 54034 
observations = observations.withColumn("geometry", ST.transform("geometry", world_cylindrical_equal_area)) 

# filter to analysis extent 
bay_extent = (-10618709.0729918, 3107437.29107338, -10519821.7905498, 3153061.02094871) 
observations = observations.where(ST.bbox_intersects("geometry", *bay_extent)) 

 

 

At this point in time, we will have a dataset that is ~1.2 million records. We can see the shipping lanes taking shape visually – but at this point it’s just a lot of individual points coalescing to form visual lines.    

AIS_point_data_cargo_tanker.jpg

 

When we zoom in, we can see that it’s still a messy dataset with a lot of variation and noise in the paths. 

AIS_point_data_cargo_tanker_zoom.jpg

 

Remove Dwells 

In the detailed image above, we can see a particularly large amount of noise north of the area that looks like the main shipping lane.  This is likely where vessels have anchored for periods of time. We don’t want that noise to skew our centerline calculations, so we can use the Find Dwell Locations tool to identify the points and remove them from our analyses. 

In this case, we are defining a “dwell” as when a vessel hasn’t moved more than 500 meters within 120 minutes.   

 

 

# find dwells 
from geoanalytics.tools import FindDwellLocations 

result = FindDwellLocations() \ 
            .setTrackFields("MMSI") \ 
            .setDistanceMethod(distance_method="Planar") \ 
            .setDwellMaxDistance(max_distance=500, max_distance_unit="Meters") \ 
            .setDwellMinDuration(min_duration=120, min_duration_unit="Minutes") \ 
            .setOutputType(output_type="Allpoints") \ 
            .run(dataframe=observations) 

 

 

The dwell locations are identified in red below – and they seem to nicely match with the areas just off the main shipping route where the ships would reasonably anchor while waiting. 

AIS_point_dwell_locations.jpg

 

Convert AIS data into tracks for each vessel 

Now that the Dwell points have been identified, we can easily select out the remaining points to convert into tracks for each vessel. To do this, we use the Reconstruct Tracks tool to connect the GPS points for each individual vessel (identified by the MMSI attribute) and limit the points used to only those where the DwellID is Null.  This results in 1,172 unique vessel tracks. 

 

 

From geoanalytics.tools import ReconstructTracks 

tracks = ReconstructTracks()\ 
  .setTrackFields("MMSI")\ 
  .setDistanceMethod("Planar")\ 
  .setDistanceSplit(1, "kilometer")\ 
  .run(observations_dwell.where("DwellID is Null"))\ 
  .localCheckpoint() 

 

 


AIS_track_data.jpg

 

Now, we have more than 1,100 unique vessel tracks, but we still need to reduce them down to a single polyline representing the shipping path. We can do that by: 

  • aggregating the tracks into bins 
  • identifying the bins with the greatest traffic through them 
  • converting the path through these high-traffic bins into shipping paths by finding the centerline through the bins  

 

Aggregate the tracks into bins and identify the heaviest traffic bins 

While we could visually pick out the overall pathway most traveled by the vessels because the lines for their tracks coalesced together into a single thick black line on the map, we don’t have a sense of where in that thick line the traffic is the heaviest.  We can identify that by spatially binning the data and counting the number of paths crossing each bin.  In this case, we’ll use a set of 50-meter square bins to find the areas with the greatest traffic.   

 

 

bin_size = 50 # meters  
bin_buffer = 50 # meters  

bins = tracks.withColumn("bin", F.explode(ST.square_bins("track_geometry", bin_size, bin_buffer)))\ 
  .groupBy("bin").count()\ 
  .withColumn("bin_geometry", ST.bin_geometry("bin"))\ 
  .localCheckpoint() 

 

 

That gives us a nice set of data showing frequency of vessel traffic.  

AIS_vessel_traffic_all.jpg

 

Now, we can clearly see the most traversed lane in the middle, but there is a lot of noise surrounding it – likely due to the ships traveling to or leaving anchor points or veering slightly off the main path.  We can easily remove that by filtering down to just the bins with more than, say, 50 vessels traveling through that location.  While this threshold is appropriate for this data, if you are working with much larger datasets your filter value will likely be higher. 

Filtering to just the bins with more than 50 vessels traveling through gives us a cleaner view: 

AIS_vessel_traffic_simplified.jpg

 

But these are just square bins, how do we turn them into a shipping lane? 

 

Union and generalize the bins to create travel regions 

To convert the bins into a polygon showing the most traveled region we simply union them together to create a big polygon.  Then to make it less ‘blocky’, we’ll generalize to smooth out the edges. 

 

 

unioned = bins.where("count > 50")\ 
  .groupBy().agg(ST.aggr_union("bin_geometry").alias("unioned"))\ 
  .withColumn("unioned", ST.generalize("unioned", 50))\ 
  .localCheckpoint() 

 

 

 

AIS_unioned_traffic.jpg

 

Pull out the centerline for the travel regions 

Once we have those polygons, we can quickly pull out the final shipping lane by taking the centerline of the polygons.   

 

 

# use the centerline tool to create the "route" polyline 
centerline = unioned.withColumn("centerline", ST.centerline("unioned")).drop("unioned")\ 
  .localCheckpoint() 

 

 

That leaves us with a nice single polyline representation of the common pathways! 

AIS_centerline.jpg

 

Here is the final result for the whole Galveston Bay region, from raw GPS data points to streamlined shipping lane polylines! 

AIS_points_to_centerline.jpg

 

Performance 

On the Houston Bay dataset for cargo and tanker vessels (~1.2 million records), the entire workflow took 38.6 seconds in a local spark cluster on  a mid-range laptop! 

For the entire dataset (~52 million records when reduced to cargo and tanker vessels), it took 4 minutes and 4 seconds in this local spark cluster. Using a larger bin size of 300 meters – more appropriate for small-scale mapping (large geographic area), instead of the 50-meter bins used for the Galveston Harbor subset, reduces the processing time to about 90 seconds.  Even better performance can be achieved by transitioning this workflow to a larger cloud computing cluster that is optimized for data processing at scale. With a more performant computing cluster, this workflow can easily scale up to work with hundreds of millions, or billions, of data points! 

While the example we walked through demonstrated a use case for creating shipping lane data, this technique will work for other datasets, for instance, to map new roads or pathways through areas that are off the street network (e.g., parking lots, etc.).  We hope this gives you some ideas on ways to work with large volume GPS track data in your workflows. We would love to hear about how you can use these tools and methodology in your analysis work, or if you want to know more about other GeoAnalytics Engine tools and functions!  Please feel free to provide feedback or ask questions in the comment section below.

 

Blog authored by Sarah Battersby, Product Manager for ArcGIS GeoAnalytics Engine, and Michael Park, Development Lead for GeoAnalytics Engine.