listfeatureclasses bugged?

RyanSchuermann · ‎04-13-2013

I'm really stumped...
simple code here....not working....

import arcpy
arcpy.env.workspace = "C:/Users/Ryan/Routes.gdb"
shplist = arcpy.ListFeatureClasses()
print shplist

my Routes.gdb has ~9,000 shapefiles in it. Each shapefile is very small, maybe 3 to 50 line features (a route) The list returns empty...the output is []

If i run this exact code above, but point it to a Routes.gdb with only 25 shapefiles, it returns a correct list. This smaller gdb was generated from the exact same code that generated the large gdb, just run against a very small sample of input data.

I know the ~9,000 shapefiles are in Routes.gdb because I can see them in ArcCatalog 10.1 ..it just takes 30minutes for them to load in the Catalog Tree, or the "select input datasets to Merge window".

I even tried getting a subset of the files in Routes.gdb by using a mask, shplist = arcpy.ListFeatureClasses("Route0*")still no go...yet it works with the small gdb.

I thought file geodatabases were suppose to be able to handle large datasets? I could even understand if later the merge failed hitting the 2GB limit I read everywhere, but I can't even get there without a file list...the file list should not fail...at least I have yet to see anyone post that it has a limit.

Only thing I can think of is the list is too long, in which case, a subset, selecting, ~300 files should be no problem but that fails too as mentioned above.

I'm going to try and wade through the lag that's encountered when trying to use Merge from ArcCatalog, but that may be the only way to merge large numbers of small shapefiles? I would rather NOT have to leave python, just to do this and then go back to python, thus forcing me to run multiple programs instead of one nice and clean, fully functional program.

what am I doing wrong, or could improve on? thanks!

I'm running this on a server-grade system octocore 2.66GHz w/ 20GB ECC FB-DIMMS so it's not a wimpy system issue.

...and YES 🙂 in hindsight (and i've already changed the code to do so) I should have been merging them in small chunks as I generated the files..but it's too late now. This process took over 100 hours to run non-stop (running on a RAMDISK) and I don't have time to run it again.

KimOllivier · ‎04-15-2013

I would think that the designer never expected a file geodatabase would have 9000 featureclasses. So it is likely that a buffer is overflowing.

The obvious question is: Why are you modelling your problem with a separate featureclass per route? I regularly build a route system in a single featureclass with 200,000 separate routes, one for each road name. It performs normally.

How about building the routes differently?

I cannot conceive of any single process that takes longer than a cup of coffee to complete. If it takes longer than than, then interrupt the process and find a better way!
I am confident that you could find a way to complete the same work in a few minutes.

Do you have indexes for key fields?
Are you running geoprocessing tools inside a cursor looping around each feature?
Are you processing single routes when you could run a tool that processes them all in one step?

RyanSchuermann · ‎04-15-2013

Kimo, thanks for the reply.

Actually my math was horribly off, it's closer to 20,000 shapefiles in the geodatabase.

Python can easily hold a list much bigger than 20k so why is listfeatureclasses not working..is really my question? I looked, searched for any limit on the function's return list but couldn't find any information. Considering it's a most basic dbase i/o call, accessing the shapefiles and returning them ina list, even 20 trillion....should be no issue.

To address your question, yes there was absolutely a better way to write the code to avoid 20k files, in actuality I can achieved the results with generating 0 files. But, I already ran it for the past 4 days, and the results I need to calculate frequency are there...so I was really hoping to simply iterate through all of the files and calculate frequency. I didn't add that code at the end, well, because I just didn't realize solving an OD matrix of that size would present an accessibility issue.

I was under the impression that the student's requesting the data actually needed the individual streets per route for some analysis, when in fact they really only needed frequency of each street as it appears in the OD matrix solve, but I didn't 'get that' until the next week.

I gave up trying to access the data in the geodatabase and am spending another 2 days re-solving the 20,000 routes, spatially selecting the streets associated with each route, and pushing each street's unique id to a list, which is easily analyzed for frequency with Counter and then joined to the Streets. No saving the routes/streets is actually necessary 🙂

But hey, at least now I have a program that creates polylines of streets along OD Matrix solved routes, and frequency tables!

however I don't know of a better way to acquire the information that is required. Because arcgis does not do route analysis for an OD Matrix (it only provides a cost report), if I am wanting to show/analyze all of the streets that are involved in, say a 70x300 OD Matrix for (a very large city)

KimOllivier · ‎04-15-2013

Kimo, thanks for the reply.

however I don't know of a better way to acquire the information that is required. Because arcgis does not do route analysis for an OD Matrix (it only provides a cost report), if I am wanting to show/analyze all of the streets that are involved in, say a 70x300 OD Matrix for (a very large city)

You seem to be reinventing the available tools in a script. If you can think of a workflow that processes whole featuresets in once step you can avoid stepping through each feature. Think of the processes like an SQL query. In a relational database you do not have a loop to interate over each row. The same applies to featureclasses, they are supposed to be "geo-relational".

It takes a while to get out of the "pilot mindset" - think of a solution for one feature, then wrap a loop around that. Instead think in a more global, generalised way on the relationship between features and find a tool to do it in one step.

If any process takes longer than a cup of coffee, interrupt the process and find a better way!
I have never had to run a model that takes longer than drinking the cup of coffee. After all you have to do many trials.

RyanSchuermann · ‎04-18-2013

right....

...back to the orig posted issue..I'm still clueless why the arcpy function listfeatureclasses() returns an empty list when trying to access a file geodatabase containing 20,000 shapefiles.

And again, it's not raising an error, the list is literally empty, not propagated.

I'm wondering if it's me, or the function, and if there is a functional limit, what would be some ways around it?

KimOllivier · ‎04-19-2013

Ok, there must be a limit somewhere, and it won't be the length of a python list.

Have you tried making a list of one? That is, instead of a default, put in a wild card that is an exact file name.

arcpy.env.workspace = "path_to_gdb" # essential
print arcpy.env.workspace # just to check

listOfOne = arcpy.ListFeatureClasses("thefirstfile")
print listOfOne
If this works, then maybe you could make a wild card listing of a subset
listOf2k = arcpy.ListFeatureClasses("image1*")
print len(listOf2k)

Second suggestion, if you already know what the names are, just use a python list to process them.
No need to get a list at all
eg suppose they are image00000 to image19999
(this really hurts because there should not be so many)

for n in range(20000):
    fc = "image"+str(n)
    if arcpy.Exists(fc):
        process(fc)
    else :
        print "Oops, guessed wrong",fc

Maybe arcpyDescribe would be a workaround, this has a child_list property:

desc = arcpy.Describe(gdb)
lstChildren = desc.children
print len(lstChildren),"featureclasses in ",gdb, "from",lstChildren[0]," to ",lstChildren[-1]

Maybe you could split the data into multiple databases of 5000 featureclasses?

It would help if your terminology was precise, computers like that. You cannot have shapefiles in a geodatabase, they are featureclasses. Do you really have a folder of shapefiles or a filegeodatabase of featureclasses? Is it really a personal geodatabase in an Access database?

If they are really shapefiles, that might explain a lot. You could use the glob module to make a list.

import glob
listOfShapefiles = glob.glob(*.shp")

Perhaps you could install PostGIS which being an Enterprise geodatabase will not have any limits.

IMHO you haven't got a large geodatabase, just a disfunctional one. A large database is something with one featureclass greater than 2 GB. Why would you persevere with something that takes 30 minutes to open in ArcCatalog? Surely this indicates a drastic rethink of you schema and workflow?

Of course you can reconfigure you problem to run in the time to drink a cup of coffee.
My suggestion is to abandon your 4 days of processing and work out a better way over a cup of coffee. You must be making a fundamental mistake. I have never run a process for 4 days and would not consider it a solution if it took that long.

RyanSchuermann · ‎04-19-2013

Yes..yes. sorry to be so confusing.

I have a file geodatabase.
it contains 20,526 feature classes
Each feature class is small, containing line features that correspond to a specific Network Analysis Solve (Origin to Destination)

Again, can we please get past the issue of what I should or should not have done programatically to reduce the number of
feature classes? I've already fixed that for future runs 🙂 But what's done is done and I am trying to access the geodatabase, just like someone else who may be dealing with a file geodatabase with 20k+ feature classes...which is the real issue here.

The total size of the gdb is ~20GB

As stated in my first post, setting the env to the gdb, and trying to get a list of the feature classes fails. I tried using a mask to isolate (return a list of) 311 of them, and that also failed.

yes I could force-create a list of the feature class names, because I created/named them..but...what if the featureclass filenames in the gdb are unknown (or random)? That really isn't a optimal, repeatable solution 🙂 A good idea for this specific situation tho!

I'll take a look at Describe, thanks!

ChrisSnyder · ‎04-19-2013

Don't know if it would help you in this case, but in the past I have made a "look up" reference table that includes the road segment IDs that each of my routes traverse... that way I can just make a "view" of my routes for summary/analysis purposes on-demand instead of keeping the actual spatial data for each route (which after a point becomes impractical). The routes are composed of the original road segments anyway - so it begs the question for the need to store redundant spatial data.

Per your original bug assertion, here's some code to test where arcpy.ListFeatureClasses breaks... I got up to 5000+ FC's able to be listed before I killed it... I was using the in_memeory workspace though for speeds sake... FGDB is way slower.

import arcpy
arcpy.env.overwriteOutput = True
arcpy.env.workspace = arcpy.env.scratchGDB #I used in_memory instead to test it
loopCount = 1
fcName = "fc_" + str(loopCount)
arcpy.CreateFeatureclass_management(arcpy.env.workspace, fcName, "POINT")
fcListCount = 1
while fcListCount > 0:
    loopCount = loopCount + 1
    fcName = "fc_" + str(loopCount)
    fcListCount = len(arcpy.ListFeatureClasses())
    arcpy.CreateFeatureclass_management(arcpy.env.workspace, fcName, "POINT")
    fcListCount = len(arcpy.ListFeatureClasses())
    print "At loop count = " + str(loopCount) + " there were " + str(fcListCount) + " featureclasses listed..."

KimOllivier · ‎04-20-2013

Yes..yes. sorry to be so confusing.

Again, can we please get past the issue of what I should or should not have done programatically to reduce the number of
feature classes? I've already fixed that for future runs 🙂 But what's done is done and I am trying to access the geodatabase, just like someone else who may be dealing with a file geodatabase with 20k+ feature classes...which is the real issue here.

The total size of the gdb is ~20GB

I am trying to help you solve the original problem, not get you down from being stuck up the wrong tree. I think you should abandon the 100 hours of effort and reconsider your approach. You have clearly found some limits with your approach. It think it is because the software was not designed to handle the problem by splitting into individual featureclasses.

Here is an example that is close to your problem. It is an OD matrix of 13,000 routes. It was done by loading all the points into one layer and solving for all routes at once. The processing time took less than one hour. The file geodatabase is 112MB.
I didn't even have to write a script. I have access to all the routes with from and to IDs, a unique name for each route and the total length. What else do you need?

RyanSchuermann · ‎04-21-2013

Don't know if it would help you in this case, but in the past I have made a "look up" reference table that includes the road segment IDs that each of my routes traverse... that way I can just make a "view" of my routes for summary/analysis purposes on-demand instead of keeping the actual spatial data for each route (which after a point becomes impractical). The routes are composed of the original road segments anyway - so it begs the question for the need to store redundant spatial data.

..yes that's exactly what I modified my code to do. 🙂 But it still took 48 hours to process 20k routes (one has to consider the size and complexity of my network dataset to understand why it took so long). And again, I could have reduced that to 1/8th by using all of my processor cores, but I have other things to do while it runs.

Per your original bug assertion, here's some code to test where arcpy.ListFeatureClasses breaks... I got up to 5000+ FC's able to be listed before I killed it... I was using the in_memeory workspace though for speeds sake... FGDB is way slower.

I was thinking of doing the same kind of test when I have 'free' time...5k is a far cry from 20k, and I'm not sure if feature class name length impacts anything...so yes, one day I'll make a loop to create feature classes in increments of 5k (a run with short and another w/ long names), and sick listfeatureclasses on the gdb...and see if it ever fails. If not..well...then I'll still be clueless as to why it returned an empty list on mine (which has been deleted a long time ago)

thanks!