Some basic exploration into the generic point in... style questions.
A recent comment about Select By Location taking a long time when point sets were large got me thinking. The background on this topic generally leave one staring a reams of code with little to no explanation on the logic. I will continue with that trend .
First the code. I use python and numpy to do a lot of the work and we will use arcpy to communicate with ArcMap or ArcGIS Pro.
I created a featureclass in a file geodatabase of points, and a polygon representing a selection polygon. The following script was used to generate the data prior to testing the code. The sample was small for now, 10,000 points and one polygon (square). More on scaling at a later time.
Explore.
"""
:Script: point_in_rect
:Author: Dan.Patterson@carleton.ca
:Purpose:
: To determine whether points are within the extents of polygons.
:
:References:
: http://stackoverflow.com/questions/30481577/
: assign-numpy-array-of-points-to-a-2d-square-grid
: http://stackoverflow.com/questions/33051244/
: numpy-filter-points-within-bounding-box
: https://stackoverflow.com/questions/33051244/
: numpy-filter-points-within-bounding-box/33051576
:
"""
import numpy as np
import arcpy
np.set_printoptions(edgeitems=5, linewidth=80, precision=4,
suppress=True, threshold=20)
def contains(pnts, ext, in_out=True):
"""Performs a logical_and to find points within a box/ext
:Requires:
:--------
: pnts - an array of points
: ext - the extent of the rectangle being tested as an array of the
: left bottom (LB) and upper right (RT) coordinates
: in_out - boolean, whether to return inside and outside points
:
:Notes:
:-----
: comp - np.logical_and( great-eq LB, less RT)
: inside - np.where(np.prod(comp, axis=1) == 1)
: case - comp returns [True, False] so you take the product
: idx_in - indices derived using where since case will be 0 or 1
: inside - slice the pnts using idx_in
"""
outside = None
LB, RT = ext
comp = np.logical_and((LB <= pnts), (pnts <= RT))
case = comp[:, 0] * comp[:, 1]
idx_in = np.where(case)[0]
inside = pnts[idx_in]
if in_out:
idx_out = np.where(~case)[0]
outside = pnts[idx_out]
return inside, outside
if __name__ == "__main__":
"""make some points for testing, create and extent,
"""
ext = np.array([[342000, 5022000], [343000, 5023000]])
in_fc = r'C:\Git_Dan\a_Data\testdata.gdb\xy_10k'
SR = arcpy.SpatialReference(2951)
a = arcpy.da.FeatureClassToNumPyArray(in_fc,
['SHAPE@X', 'SHAPE@Y'],
spatial_reference=SR)
a0 = a.view(dtype=np.float).reshape(len(a), 2)
arcpy.env.overwriteOutput = True
arcpy.env.workspace = r'C:\Git_Dan\a_Data\testdata.gdb'
arcpy.MakeFeatureLayer_management('xy_10k', 'xy_lyr')
arcpy.management.SelectLayerByLocation("xy_lyr", "INTERSECT", "subpoly",
None, "NEW_SELECTION", "NOT_INVERT")
matchcount = int(arcpy.GetCount_management('xy_lyr')[0])
The results are shown in the figure below.
In this example, the script was run as it was, then the IPython magic %timeit was used to time just the portions that didn't require data preparation.
In both cases, 401 points were returned from a featureclass of 10,000 points. The timing results are below.
Timing Results
(1) Numpy
Timing contains with the array created from the points.
%timeit contains(a0, ext, in_out=False)
195 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(2) ArcGIS Pro and SelectByLocation
Just timing the final SelectLayerByLocation
%timeit arcpy.management.SelectLayerByLocation("xy_lyr", "INTERSECT", "subpoly", None, "NEW_SELECTION", "NOT_INVERT")
308 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(3) arcpy in ArcGIS Pro
In this incarnation, I use the same files and extract the geometry from them and perform a point 'within' polygon. The following shows the appropriate additions.
def src_geom(pnt_fc, poly_fc):
""" a search cursor approach
"""
pnts = [p[0] for p in arcpy.da.SearchCursor(pnt_fc, "SHAPE@")]
polys = [pl[0] for pl in arcpy.da.SearchCursor(poly_fc, "SHAPE@")]
poly = polys[0]
is_in = [pnt.within(poly) for pnt in pnts]
return pnts, poly, is_in
if __name__ == "__main__":
"""searchcursor, arcpy point within polygon approach
"""
arcpy.env.overwriteOutput = True
arcpy.env.workspace = r'C:\Git_Dan\a_Data\testdata.gdb'
pnt_fc = r'C:\Git_Dan\a_Data\testdata.gdb\xy_10k'
poly_fc = r'C:\Git_Dan\a_Data\testdata.gdb\subpoly'
pnts, poly, is_in = src_geom(pnt_fc, poly_fc)
pnts_in = sum(is_in)
The result yielded the same number of points ( ie pnts_in = 401)
%timeit src_geom(pnt_fc, poly_fc)
472 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(4) MatPlotLib and Path.contains_point
MatPlotLib's 'path' module can be used to implement point in polygon testing. Since it is available within the Anaconda suite, then it is included here for comparison purposes.
from matplotlib.path import Path
pip = Path.contains_point
def _extent(ext):
"""construct the extent rectangle from the extent points which are the
: lower left and upper right points [LB, RT]
"""
LB, RT = ext
L, B = LB
R, T = RT
box = [LB, [L, T], RT, [R, B], LB]
ext_rect = np.array(box)
return ext_rect
def containsmpl(pnts, ext):
"""matplotlib incarnation
"""
poly = _extent(ext)
poly = matplotlib.path.Path(poly)
is_in = [pip(poly, p) for p in pnts]
return np.array(is_in)
Using the numpy array objects for the points and the polygon, as inputs, the results are as follows:
%timeit containsmpl(a, ext)
24.7 ms ± 761 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I think it uses the 'crossing number' method of testing which would be better for general polygons, something which has to be added to the pure numpy approach, after the selection of the points in the extent has been completed.
To come...
(4) pandas in ArcGIS
(5) Other methods to assess 'containment'
Results so far... In all cases, the result was 401 points within the one polygon.
Numpy 0.195 ms (195 microseconds)
MatPlotLib 24.7 ms
SelectByLocation 308 ms
arcpy 472 ms