Appending- how to avoid duplicates?

14617
29
03-27-2017 06:50 AM
TheoFaull
Occasional Contributor III

I have two point shapefiles, both have the exact same fields in them. There are some new records AND some duplciate records when comparing the two datasets.

I want to use the append tool as I don't want to create a new dataset, I just want to add data to the existing original shapefile.

However, when I append the two shapefiles, matching records are appended, thus leaving lots of duplicate records. How can I tell my script to only append new records and ignore duplicates?

Tags (2)
29 Replies
BruceHarold
Esri Regular Contributor

You could certainly script that.

TheoFaull
Occasional Contributor III

Bruce Harold

Yes it's the stopping service and starting service I can't figure out.

We have ArcGIS Desktop 10.4 with Python 2.7.10

and

ArcGIS for Server 10.0 with Python 2.6.5 on our seperate server machine (we host a lot of mapping data on this server including a web map interface which the whole company use. It's old but it works and we're reluctant to upgrade the software (no time to!))

0 Kudos
MitchHolley1
MVP Regular Contributor

You could have a script write a comment to a text field that are duplicates, then append records that do not have a comment. 

import arcpy

shp1 = r'...path to shapefile1...'
shp2 = r'...path to shapefile2...'

shp1keys = []


#Field arguments is the common field between both datasets
with arcpy.da.SearchCursor(shp1, ['field_with_dups']) as cursor:
    for row in cursor:
        shp1keys.append(row[0])
del cursor

#Field arguments with common field plus 'Comment' field to write duplicates
with arcpy.da.UpdateCursor(shp2, ['field_with_dups', 'Comment']) as cursor:
    for row in cursor:
        if row[0] in shp1keys:
            row[1] = 'DUPLICATE'
            cursor.updateRow(row)
del cursor‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
0 Kudos
TheoFaull
Occasional Contributor III

Yes true, but two duplicate records may not always be identical. The newer one may have updated information within it. So how could the script decide which record is the newer and updated record? Thus keeping it.

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

Not to nitpick on semantics, but duplicate and identical are synonyms.  If you are talking about a subset of fields, then what are the fields you want to compare to determine if a record should be updated/overwritten?  You have described your overflow workflow goals, but you haven't described the data set beyond being points.  The more specifics you share, the more specific the responses.

0 Kudos
TheoFaull
Occasional Contributor III

Of course. I haven't gone into details just yet and thoroughly appreciate Mitch's response. But anyway, there are 20 fields and 25,000 records. All the data in these records is subject to change, excluding a couple of ID fields.

What I think I really need is to delete the existing shapefile, and replace with the updated version. However a LOCK file originating from the server (which accesses the shapefile) stops me from deleting this dataset. I need a script which:

1. Stop Server GIS service

2. Delete Points.shp

3. Create updated version of Points.shp in same directory with same name.

4. Start Server GIS service

0 Kudos
NeilAyres
MVP Alum

Generally when updating data which is used by a service we truncate then append the new data. That doesn't seem to upset things. Deleting is a no-no.

TheoFaull
Occasional Contributor III

Is that 'Truncate Table' you use? And does your method work even if there's a LOCK file on the data?

0 Kudos
NeilAyres
MVP Alum

Where is the data - fgdb or SDE?

Normally for publishing purposes (and not for editable features), we copy over from an enterprise system to a fgdb every night.

That uses truncate and append. Doesn't affect the services based on these features.

Not an expert on how all this works myself, just aware of the process.