Update cursor with joined tables work around w/ dictionaries

MathewCoyle · ‎04-12-2012

This post could easily be called "How I fell in love with dictionaries"

Drawing the idea from this post http://forums.arcgis.com/threads/52511-Cool-cursor-dictionary-constructor-one-liner

I've come up with a solution to a nagging problem I know I have been having, and I believe some others have as well, of not being able to reliably use an update cursor when dealing with joined tables. I was really happy with my first foray into dictionaries, and I thought I'd share my work around for anyone looking to optimize some tedious processing with joins. My data was ~900k rows of forest stand data in one table, and a strata reference table of ~50 rows to calculate volumes. My previous method of using a permanent JoinField, processing, then deleting those fields, took approximately 3.5 hours. Temporary joins never worked for me in the manner I needed. Using dictionaries instead of joins, that time was reduced to under 15 minutes.

This code goes through any table and creates a list of field names for every field other than OID and the key field you want to reference.

Here is the fairly complete code to create the dictionary

    print "Starting function"     # Define and setup variables, tables, key field etc     calc_table = arcpy.MakeTableView_management(table_path)     vol_tab = join_table_path     strata_tab = "in_memory/temp"     arcpy.MakeTableView_management(vol_tab, strata_tab)     joinField = "STRATA"          # Create list of value fields, leaving out OID field and key/join field     flistObj = arcpy.ListFields(strata_tab)     flist = []     for f in flistObj:         if f.type != "OID" and f.name != joinField:             flist.append(f.name)      # Create empty dict object then populate each key with a sub dict by row using value fields as keys     strataDict = {}      for r in arcpy.SearchCursor(strata_tab):         fieldvaldict = {}         for field in flist:             fieldvaldict[field] = r.getValue(field)         strataDict[r.getValue(joinField)] = fieldvaldict      del strata_tab, flistObj

In the update cursor you can then either explicitly reference dictionary objects like this

    rows = arcpy.UpdateCursor(calc_table, "\"%s\" IS NOT NULL" % joinField)     for row in rows:         strata = row.getValue(joinField)         variable = strataDict[strata]["sub_key_field"]

What I did was use a reference list to reference the dictionary to keep things legible, and so I could remember what went where. This may not even be necessary for some people, but it helped me conceptually. Without getting in to too much detail, here's essentially my update cursor sans the actual calculations.

    species = [     ("C","Fb","FB_STEMS"),("C","Sw","SW_STEMS"),("C","Pj","PJ_STEMS"), # 0,1,2     ("C","Pl","PJ_STEMS"),("C","Lt","LT_STEMS"),("C","Sb","SB_STEMS"), # 3,4,5     ("D","Bw","BW_STEMS"),("D","Aw","AW_STEMS"),("D","Pb","PB_STEMS")  # 6,7,8     ]     sp_fields = [("SP1","SP1_PER"),("SP2","SP2_PER"),("SP3","SP3_PER"),     ("SP4","SP4_PER"),("SP5","SP5_PER")]     print "Beginning updates"     rows = arcpy.UpdateCursor(calc_table, "\"%s\" IS NOT NULL" % joinField)     for row in rows:         strata = row.getValue(joinField)         for sp, per in sp_fields:             sp_type = row.getValue(sp)             spp_f = float(row.getValue(per))             if spp_f > 0:                 for grp, spec, stem in species:                     stem_f = strataDict[strata][stem]                     (...)

Hopefully that didn't get too convoluted, anyone else have anything that might contribute in terms of optimization?

BruceBacia · ‎06-01-2012

I think this will work, too...even shorter

flistObj = arcpy.Listfields(strata_tab)
exclude = ['OID','joinField']
flist = [f.name for f in flistObj if f.name not in exclude]

View solution in original post

KimOllivier · ‎04-14-2012

I am using dictionaries to update tables instead of a join more as well.
I tried refactoring my clumsy lines to use the oneline list comprehension but it turned out to be marginally slower.

222565 dictionary count 0:00:46.594000
222565 dictionary count 0:00:48.125000

I note that you do not bother to specify a subset of fields when opening the cursor. If you have a lot of fields it apparently helps a lot to only list the relevant fields for the calculations. Not so easy to generalise I suppose, but it may help with memory management too.
Has anyone done some tests on the 10.1 da module that has rewritten cursors? Maybe we will not need dictionaries after all.

RaphaelR · ‎05-29-2012

Thanks for this!
had lots of troubles with processing/updating joined tables, took ages within arcmap/didn´t work at all with updatecursors.
with your suggested dictionaries-route i´ve managed to get it working and really sped things up.

ChrisSnyder · ‎05-29-2012

Related to this post: http://forums.arcgis.com/threads/58348-Large-Dictionary-Compression, I am having troubles when the dictionaries get too big!

Although it's slower, especially for multiple fields, I am finding the ole' "Join and Calc" method is much more memory efficient.

MathewCoyle · ‎05-29-2012

Related to this post: http://forums.arcgis.com/threads/58348-Large-Dictionary-Compression, I am having troubles when the dictionaries get too big!

Although it's slower, especially for multiple fields, I am finding the ole' "Join and Calc" method is much more memory efficient.

Yes, I can imagine when you get into storing multiple million tuple datasets to memory on a 32-bit process, you're going to have a bad time. When I implemented mine it was only ~50 rows to reference to the main table, which worked out quite well. I have another process with a 1:1 relationship on the 900k row dataset that I use a join and export process to run calculations on. I hope Esri bites the bullet this decade and converts desktop to a 64-bit application. It's not like datasets or file complexity are shrinking.

Maybe as a quick fix develop some more easy to use interfaces between desktop and server to submit large geoprocessing jobs to server post 10.1 which utilizes 64-bit python.
http://forums.arcgis.com/threads/54612-arcpy-is-using-32bit-Python-installation-how-about-64bit

BruceBacia · ‎06-01-2012

It's funny I'm reading this post today.....I just switched one of my scripts from a join and select method to a dictionary method and processing time went down from 2 hours to 8 minutes. Long live the dictionary!

BruceBacia · ‎06-01-2012

Here's a neat, pythonesque way of removing unwanted field names. Not sure if it will be faster, but it looks cooler!

flistObj = arcpy.Listfields(strata_tab)
flist = [f.name for f in flistObj]
for exclude in ['OID','joinField']:[INDENT]flist.remove(exclude)[/INDENT]

BruceBacia · ‎06-01-2012

I think this will work, too...even shorter

flistObj = arcpy.Listfields(strata_tab)
exclude = ['OID','joinField']
flist = [f.name for f in flistObj if f.name not in exclude]

PeterWilson · ‎10-05-2012

Hi Mathew

I came across your thread and hope that you are able to assist me to use python dictionaries to accomplish what I'm trying to do. Please note that I'm new to Python and would need some assistance to understand your code if you don't mind and have the time.

I have 7.5 million parcles saved as a feature class. Within the feature class I have a field called "SG_Code". I also have two tables called WARMS (i.e. WARMS_DW760 & WARMS DW764). They each have a field called "SG_Code" & "TIT_DEED_NUM". I then have another two additional tables called RED (i.e. Redistribution) and REST (i.e. Restitution). The RED and REST tables have a two fields "SG_CODE" and "TIT_DEED_NUM".

I need to create a subset feature class of the 7.5 million parcles where I find a match using firstly the "SG_Code" between the parcles feature class and each WARMS table separately (i.e. WARMS_DW760 then WARMS_DW764). I then need to find a match using the original 7.5 million feature class and RED and REST tables using the "SG_Code". Then I need to find a match based on the match already found using the 7.5 million records between the WARMS_DW760 and WARMS_DW764 and then match the "TIT_DEED_NUM" and the "TIT_DEED_NUM" found in the RED and REST tables to see if I find additional matches using the "TIT_DEED_NUM" as not all the records have "SG_Codes" within the REST and RED tables.

In short, what I'm trying to accomplish is to identify where I find a match between the parcles and warms, then a match between the parcles and RED and REST.

I've used Add Joins so far to accomplish this, but its running forever. I've attached my model that I've built so far to better understand what I'm trying to accomplish.

Regards

ChrisSnyder · ‎10-06-2012

Peter - The basic method is shown here: http://forums.arcgis.com/threads/9555-Modifying-Permanent-Sort-script-by-Chris-Snyder?p=30010&viewfu...