Large Dictionary Compression?

3010
8
05-23-2012 10:34 AM
ChrisSnyder
Regular Contributor III
I have a simple dictionary like this:

exampleDict[123444556] = (1785,2234544,3545456, 165765.47654)

where all the keys are intgers and the values are either integers or floats.

My issue is that I have the need to store/access about 20 million keys at a time, and I am running out of 32-bit memory. I'd rather do this in 32-bit Python as I need (or would like) access to arcpy for its FGDB table reading/writting abilities.

Anyone know of a way to somehow "compress" keys and/or values in a dictionary? I'm looking into the binascii module, and I see lots of methods to compress strings, but not ints or floats. Maybe you can't meaningfully compress these since they are already quite numeric?

Anyone ever do something like this?
Tags (2)
0 Kudos
8 Replies
JasonScheirer
Occasional Contributor III
If you're running out of memory, you're sort of out of luck because internally integers are already stored as space-efficiently as possible.

You might want to consider some other key-value store, such as anydbm or even setting up a Redis server and talking to that from python.
0 Kudos
ChrisSnyder
Regular Contributor III
Thanks Jason - I'll look into those...
0 Kudos
ChrisSnyder
Regular Contributor III
Jason, after looking at stuff... Hmmm - seems a bit over my head I think.

But my work around solution (not working quite 100% yet) is to just:
1. export the FGDB tables to .txt format (thankfully the txt versions are < 2GB!).
2. call 64-bit Python.exe as a subprocess (which actually seems to work)
3. have that 64-bit python.exe process read the "tables" (txt files) into dictionaries, do the analysis, write the results out to .txt format
4. back in 32-bit "arcpy-compliant" Python land, read the analysis txt table back into FGDB table format, and then *** big inhale *** proceed with the rest of the script.

Here's to 64-bit :cool: and the hope that we may be have a 64-bit version of ArcGIS some day!
0 Kudos
JasonScheirer
Occasional Contributor III
Nice! Glad you got something working. 10.1 server will be 64 bit out of the box.
0 Kudos
KimOllivier
Occasional Contributor III
What about using SQLite inside Python? This might manage data better and you can run an SQL query to do the matching instead of a dictionary.
SQLite is built into python and there are no 2GB size limits. Does it load everything into memory?

Attached is an example using SQLite to find duplicates in a large database where python dictionaries overflowed. (Not written by me)
0 Kudos
ChrisSnyder
Regular Contributor III
That looks very interesting Kim, although I don't have much hardcore SQL skill...

I think for my purposes I will stick with my Python 64-bit subprocess solution... I am using these large dictionaries to traverse/trace segments of a stream network, and speed is very critical as there are so many features involved - eventually there will be 100's of millions of features. I am comfortable writting my own code in Python to emulate fancy SQL-type stuff using dictionaries and basisically see dictionaries as a great and flexible format for creating my own RDBMS with whatever "custom" features I can dream up. I am amazed at the speed of these hash table-type structures - and I seem that the code you suplied uses some sort of formal SQL hash functionality (sadly, of which I am totally ignorant of!) - very cool.
0 Kudos
Luke_Pinner
MVP Regular Contributor
You could also take a look at the shelve module - http://docs.python.org/library/shelve.html It provides a filesystem based dict like class. Though as it's filesystem based, it will probably be slower than your 64bit python subprocess method.
0 Kudos
ChrisSnyder
Regular Contributor III
Decided to finally install and test out the new 64 bit geoprocessing upgrade for 10.1 SP1. Works like a charm (except for the whole 32-bit exceptions thing, but that's okay and understandable... I never liked PGDB anyway!). Note the RAM usage in the attached screenshot (~27 GB max in use). So I can now have my huge Python dictionaries and eat arcpy too. I bet this was Jason S.' idea - thanks for implementing :).

[ATTACH=CONFIG]22090[/ATTACH]
0 Kudos