Appending- how to avoid duplicates?

TheoFaull · ‎03-27-2017

I have two point shapefiles, both have the exact same fields in them. There are some new records AND some duplciate records when comparing the two datasets.

I want to use the append tool as I don't want to create a new dataset, I just want to add data to the existing original shapefile.

However, when I append the two shapefiles, matching records are appended, thus leaving lots of duplicate records. How can I tell my script to only append new records and ignore duplicates?

BruceHarold · ‎03-27-2017

You could certainly script that.

TheoFaull · ‎03-27-2017

Bruce Harold‌

Yes it's the stopping service and starting service I can't figure out.

We have ArcGIS Desktop 10.4 with Python 2.7.10

and

ArcGIS for Server 10.0 with Python 2.6.5 on our seperate server machine (we host a lot of mapping data on this server including a web map interface which the whole company use. It's old but it works and we're reluctant to upgrade the software (no time to!))

BruceHarold · ‎03-27-2017

Scripting section in the Help:

An overview of scripting ArcGIS Server administration—ArcGIS Server Administration (Windows) | ArcGI...

Tools from a colleague:

https://www.arcgis.com/home/item.html?id=12dde73e0e784e47818162b4d41ee340

...you may need to upgrade.

MitchHolley1 · ‎03-27-2017

You could have a script write a comment to a text field that are duplicates, then append records that do not have a comment.

import arcpy

shp1 = r'...path to shapefile1...'
shp2 = r'...path to shapefile2...'

shp1keys = []


#Field arguments is the common field between both datasets
with arcpy.da.SearchCursor(shp1, ['field_with_dups']) as cursor:
    for row in cursor:
        shp1keys.append(row[0])
del cursor

#Field arguments with common field plus 'Comment' field to write duplicates
with arcpy.da.UpdateCursor(shp2, ['field_with_dups', 'Comment']) as cursor:
    for row in cursor:
        if row[0] in shp1keys:
            row[1] = 'DUPLICATE'
            cursor.updateRow(row)
del cursor‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

TheoFaull · ‎03-27-2017

Yes true, but two duplicate records may not always be identical. The newer one may have updated information within it. So how could the script decide which record is the newer and updated record? Thus keeping it.

JoshuaBixby · ‎03-27-2017

Not to nitpick on semantics, but duplicate and identical are synonyms. If you are talking about a subset of fields, then what are the fields you want to compare to determine if a record should be updated/overwritten? You have described your overflow workflow goals, but you haven't described the data set beyond being points. The more specifics you share, the more specific the responses.

TheoFaull · ‎03-27-2017

Of course. I haven't gone into details just yet and thoroughly appreciate Mitch's response. But anyway, there are 20 fields and 25,000 records. All the data in these records is subject to change, excluding a couple of ID fields.

What I think I really need is to delete the existing shapefile, and replace with the updated version. However a LOCK file originating from the server (which accesses the shapefile) stops me from deleting this dataset. I need a script which:

1. Stop Server GIS service

2. Delete Points.shp

3. Create updated version of Points.shp in same directory with same name.

4. Start Server GIS service

NeilAyres · ‎03-27-2017

Generally when updating data which is used by a service we truncate then append the new data. That doesn't seem to upset things. Deleting is a no-no.

TheoFaull · ‎03-28-2017

Is that 'Truncate Table' you use? And does your method work even if there's a LOCK file on the data?

NeilAyres · ‎03-28-2017

Where is the data - fgdb or SDE?

Normally for publishing purposes (and not for editable features), we copy over from an enterprise system to a fgdb every night.

That uses truncate and append. Doesn't affect the services based on these features.

Not an expert on how all this works myself, just aware of the process.