features with identical geometry WKT representation not equal

JamieKass · ‎06-11-2013

The results of arcpy.FindIdentical() with "Shape" selected as field option show two features in my dataset to be identical in geometry, yet when I run a search cursor through it, and append the WKT of the two geometries into a list, these two strings are not equal (they even have different lengths). I thought the WKT was a representation of the geometry in string form. The JSON representation is also unequal.

ChrisSnyder · ‎06-11-2013

I don't have any experence with teh WKT values, but how different are the values (you mention length)?

I have had issues with geometries that include "true curve" features, that should be identical, but in fact have ever so slightly different areas/perimeters/centroid values - I am talking like a difference of 0.00001 map units. But enough to throw my script logic out of whack.

JamieKass · ‎06-11-2013

The string lengths of the WKT values differ, and although the differences are slight, any difference renders them unequal. The FindIdentical tool obviously has some internal geometry tolerance that allows it to overlook these slight differences, and I wonder if we can simulate this in some way by altering the WKT values. I am trying to explore geometries this way because running FindIdentical on large features (>100,000) often crashes or takes forever to run.

ChrisSnyder · ‎06-11-2013

I feel your pain:

http://forums.arcgis.com/threads/49557-True-Curves-True-Evil
http://forums.esri.com/Thread.asp?c=93&f=983&t=292699

ChrisSnyder · ‎06-11-2013

Assuming your geometries have some true curves in them, I bet that if you convert your FC to a shapefile, then look at the WKT values, they "should" then be ==.

About a year ago, I had been in contact with ESRI about this issue (true curve geometry being slightly off).... didn't go anywhere though... ESRI commited to curve geometry. No plans to have an envr setting to keep the GDB geometry densified (which seems to prevent these minute geometry property differences).

JamieKass · ‎06-11-2013

No, they are still different, although the string length difference has changed from 4 to 1. (???) I am running this on a buffer product, and likely these true curves are the culprit. My question is: how does FindIdentical do it? I just need to replicate that without the bells and whistles that slow it down so much. All I'm really interested in is which shapes are identical to each other after a union operation on buffers.

ChrisSnyder · ‎06-11-2013

Not sure, but I bet FindIdentical simply rounds the values - maybe with the xy tolerance value (or some derivation of it) of the FC?

All I'm really interested in is which shapes are identical to each other after a union operation on buffers.

I assume to figure out overlaps? I do this too... I just round the values of the centroid x/y coordinates and the polygon area. If it helps give you ideas:

#Flatten this puppy!
shatteredFC = fgdbPath + "\\shattered"
arcpy.Union_analysis(dissolveFC, shatteredFC, "ALL", "1 FEET", "GAPS"); showGpMessage()
singlePartFC = fgdbPath + "\\single_part"
arcpy.MultipartToSinglepart_management(shatteredFC, singlePartFC); showGpMessage()
searchRows = arcpy.da.SearchCursor(singlePartFC, ["SHAPE@","*"])
polyIdDict = {}
polyIdValue = 1
decimalTolerance = 2
for searchRow in searchRows:
    shapeFieldValue = searchRow[searchRows.fields.index("SHAPE@")]
    xCentroidValue = round(shapeFieldValue.centroid.X, decimalTolerance)
    yCentroidValue = round(shapeFieldValue.centroid.Y, decimalTolerance)
    areaValue = round(shapeFieldValue.area, decimalTolerance)
    axyValue = (xCentroidValue,yCentroidValue,areaValue)
    catagoryNameValue = searchRow[searchRows.fields.index("CATEGORY")]
    releaseYearValue = searchRow[searchRows.fields.index("RELEASE_YR")]
    retentionPctValue = searchRow[searchRows.fields.index("RETENTION_PCNT")]
    forestedFlagValue = searchRow[searchRows.fields.index("FORESTED")]
    if axyValue not in polyIdDict:
        polyIdDict[axyValue] = polyIdValue, [catagoryNameValue], [releaseYearValue],[retentionPctValue],[forestedFlagValue]
        polyIdValue = polyIdValue + 1
    else:
        polyIdDict[axyValue][1].append(catagoryNameValue)
        polyIdDict[axyValue][2].append(releaseYearValue)
        polyIdDict[axyValue][3].append(retentionPctValue)
        polyIdDict[axyValue][4].append(forestedFlagValue)
del searchRow, searchRows
#Sort the thing the way we want it
for axyValue in polyIdDict:
    polyIdDict[axyValue][1].sort() #catagory name
    polyIdDict[axyValue][2].sort(reverse=True) #release year
    polyIdDict[axyValue][3].sort(reverse=True) #retention percent
    polyIdDict[axyValue][4].sort() #forested
arcpy.AddField_management(singlePartFC, "POLY_ID", "LONG"); showGpMessage()  
arcpy.AddField_management(singlePartFC, "LCL_RSN", "TEXT", "", "", "150"); showGpMessage()
arcpy.AddField_management(singlePartFC, "RELEASE_YR_MAX", "SHORT"); showGpMessage()
arcpy.AddField_management(singlePartFC, "RETENTION_PCNT_MAX", "SHORT"); showGpMessage()
arcpy.AddField_management(singlePartFC, "FORESTED_MAX", "TEXT", "", "", "1"); showGpMessage()
arcpy.AddField_management(singlePartFC, "LCL_RP_FLG", "SHORT"); showGpMessage()
arcpy.AddField_management(singlePartFC, "LCL_UP_FLG", "SHORT"); showGpMessage()
arcpy.AddField_management(singlePartFC, "LCL_US_FLG", "SHORT"); showGpMessage()
arcpy.AddField_management(singlePartFC, "LCL_WT_FLG", "SHORT"); showGpMessage()
updateRows = arcpy.da.UpdateCursor(singlePartFC, ["SHAPE@","*"])
for updateRow in updateRows:
    shapeFieldValue = updateRow[updateRows.fields.index("SHAPE@")]
    xCentroidValue = round(shapeFieldValue.centroid.X, decimalTolerance)
    yCentroidValue = round(shapeFieldValue.centroid.Y, decimalTolerance)
    areaValue = round(shapeFieldValue.area, decimalTolerance)
    axyValue = (xCentroidValue,yCentroidValue,areaValue)
    updateRow[updateRows.fields.index("POLY_ID")] = polyIdDict[axyValue][0]
    updateRow[updateRows.fields.index("LCL_RSN")] = ",".join(i for i in sorted(set(polyIdDict[axyValue][1])))
    updateRow[updateRows.fields.index("RELEASE_YR_MAX")] = polyIdDict[axyValue][2][0]
    updateRow[updateRows.fields.index("RETENTION_PCNT_MAX")] = polyIdDict[axyValue][3][0]
    updateRow[updateRows.fields.index("FORESTED_MAX")] = polyIdDict[axyValue][4][0]
    if "RIPARIAN_AREA" in polyIdDict[axyValue][1] or "RMZ" in polyIdDict[axyValue][1]:
        updateRow[updateRows.fields.index("LCL_RP_FLG")] = 1
    if "WETLAND" in polyIdDict[axyValue][1] or "WMZ" in polyIdDict[axyValue][1]:
        updateRow[updateRows.fields.index("LCL_WT_FLG")] = 1
    if 'SLOPE_STABILITY_ISSUE' in polyIdDict[axyValue][1] or 'SLOPE_STABILITY_POTENTIAL' in polyIdDict[axyValue][1] or 'SLOPE_STABILITY_VERIFIED' in polyIdDict[axyValue][1] or 'UNSTABLE_SLOPES' in polyIdDict[axyValue][1]:
        updateRow[updateRows.fields.index("LCL_US_FLG")] = 1
    #A bit of a work around just for SPS - untl the new schema is completed...
    if 'AREA_REGULATION' in polyIdDict[axyValue][1] and polyIdDict[axyValue][3][0] < 50:
        updateRow[updateRows.fields.index("RETENTION_PCNT_MAX")] = 50
    updateRows.updateRow(updateRow)
del updateRow, updateRows

JamieKass · ‎06-13-2013

Yes, I developed something similar that creates a unique string representation of area, centroid X, and centroid Y. It is imperfect, as different datasets may require tweaking of the decimal tolerances, but it works alright the way it is.

def repGeomString(shp):
    return str(round(shp.area)) + '|' + str(round(shp.centroid.X, 2)) + '|' + str(round(shp.centroid.Y, 2))

I call this in a search cursor loop and add it to a dictionary with the OID as key, much in the same way you did it. I have experienced finding polygons with mysterious null centroids, but if you account for them somehow that's the only trip-up. This wouldn't be necessary if FindIdentical worked on large datasets. I wish the internal tolerances of this tool were exposed so we could improve these imperfect geometry estimators. I've included a sample function I've wrote that returns a dictionary relating each OID to its overlapping OIDs, based on if repGeomString is equal. Buffering often leads to small changes in shapes that should be identical, and therefore this function gets thrown off sometimes, but usually works quite well.

def fasterFindID(fc):
    """Records geometrical attributes of all features, then returns a dictionary:
        overlapDict: key = OID; value = identical overlap OIDs. The parameter fc
        is assumed to be the product of an overlap operation like Union or Intersect."""
    # this creates a representation of a geometry string by concatenating various attributes to create a "unique" identifier
    def repGeomString(shp):
        return str(round(shp.area)) + '|' + str(round(shp.centroid.X, 2)) + '|' + str(round(shp.centroid.Y, 2))

    # two dictionaries to keep track of duplicate geometries and their OIDs
    geomDictByOID = {}  # key = OID; value = represented geometry string
    uniqueGeoms = {}  # key = represented geometry string; value = list of OIDs with same geometry

    oidFieldName = arcpy.Describe(fc).oidFieldName

    nullCentroids = []

    # gather geometries, and if centroid null, append to list
    with arcpy.da.SearchCursor(fc, ("OID@", "SHAPE@")) as cursor:
        for row in cursor:
            try:
                # get full k,v pair of OID and geometries
                geomDictByOID[row[0]] = repGeomString(row[1])
                # initialize empty list as value of unique geometry key
                uniqueGeoms[repGeomString(row[1])] = []
            except:
                nullCentroids.append(row[0])

    # if null centroids exist, convert features to point and reread centroids
    # post-conversion, centroids will exist... for some reason
    if len(nullCentroids) > 0:
        # Format to string tuple for SQL query, and remove trailing comma for len = 1
        if len(nullCentroids) > 1:
            nullCentroidsFormat = str(tuple(nullCentroids))
        else:
            nullCentroidsFormat = str(tuple(nullCentroids)).replace(',', '')
        arcpy.MakeFeatureLayer_management(fc, 'nLyr', '"{0}" IN {1}'.format(oidFieldName, nullCentroidsFormat))
        nullCentroidsPts = arcpy.FeatureToPoint_management('nLyr', arcpy.Describe(fc).name + 'NullCentroids')
        # loop through null centroid features and read new centroids
        with arcpy.da.SearchCursor(nullCentroidsPts, ("OID@", "SHAPE@")) as cursor:
            for row in cursor:
                # get full k,v pair of OID and geometries
                geomDictByOID[row[0]] = repGeomString(row[1])
                # initialize empty list as value of unique geometry key
                uniqueGeoms[repGeomString(row[1])] = []

    # for each geometry key in uniqueGeoms, append all OIDs that match
    for oid, g in geomDictByOID.iteritems():
        uniqueGeoms.append(oid)

    # create dictionary that relates each oid to its overlaps
    overlapDict = {}
    for oid, g in geomDictByOID.iteritems():
        if len(uniqueGeoms) > 1:
            overlapDict[oid] = [k for k in uniqueGeoms if k != oid]

    # Delete potentially large dicts
    del geomDictByOID
    del uniqueGeoms

    return overlapDict