Howto directly access to rows by objectid

TomMoore · ‎06-01-2012

I am just starting to get familiar with the FGDB-API, and I have a question about programming technique.

I have a large data set (3 million records), and I am interested in read-only access to features (rows) in a non-sequential manner. My processing will quite likely traverse all of the rows, but in a non-linear order and some rows will be visited more than once. To traverse all records I know I can do

sqlStatement.assign(L"SELECT * FROM mytable");
if ((hr = geodatabase.ExecuteSQL(sqlStatement, true, attrQueryRows)) != S_OK)

and then process through the EnumRows result object. However, EnumRows only has a forward-only method. If I need to revisit rows then I will have to do another query (perhaps querying for the row by objectid). I am thinking that this approach is going to be quite slow since the query will have to be prepared and executed for each row retrieval.

Is there a better way to do this? Row caching is probably not a great idea since the table is quite large.

Any suggestions greatly appreciated.

VinceAngelo · ‎06-01-2012

You probably want to write your own cache, storing rows in bunches of 200-1000 -- then your query
preparation time will be drastically reduced. If you collect and log efficiency statistics, you can scale
memory use to the sweet spot in cache size.

Overall efficiency might be greatly improved by spatially organizing the records -- use a grid cell
size that permits at least 9 neighboring tiles to be cached at once.

It might take a couple of days to write and validate a cache, but you're likely to gain that back
in savings from only a few hundred passes through the search algorithm.

- V

DavidSousa · ‎06-04-2012

I think you will have to re-query for each objectID. However, the overhead of preparing a new query each time will probably not be excessive in this case.

VinceAngelo · ‎06-21-2012

So I invested a little hobby-time to this question, to determine the benefit
of caching on runtime performance...

First, I adapted some tools and created a benchmark file geodatabase containing
fifteen tables, in three sizes of five layouts. The first layout contained a
sequential list, where row 1 pointed to row 2 and row 2 to row three, and so
on until the last row, which pointed back to row 1:

The second layout was in reverse order:

        1       N
        2       1
        N-1     ...
        N       N-1

The third layout was in random order:

The fourth and fifth layouts were also in random order, but clustered so that
the selection window was limited by twenty-five or nine neighboring cells, to
capture the impact of blocking the data in 200 row chunks. Each layout was
generated with 10000 rows (s1), 100000 rows (s2), and 1 million rows (s3),
and merged with 40 additional columns to reflect ancillary data in the tables.

Then I whipped up a benchmarking C++ application, which operates in three
modes:

SEQ     Sequential read of table contents, without regard for the "next" row
ID      Simple row seeking, one query per row
RANGE   Blocked row seeking, one query per optional block size, with an
        optional number of cached blocks

The cache uses a "least recently used" (LRU) replacement strategy, so that
each block that is used is promoted to the front of the cache, and the last
block's contents are replaced if no cache hit was found.

The benchmarking tool reports the following information:

A - table name (tN_sM)
B - mode
C - number of columns*
D - Size of cache blocks (in rows)
E - number of blocks in the cache
F - Number of rows read
G - Number of row.Get{datatype} executions
H - Number of table.Search invocations
I - query time (time spend in table.Search and enumRows.Next)
J - get IO time (time spent in row.Get{datatype}
K - elapsed time (total time to execute)

Then I started bechmarking on my home PC (6x2400Mhz AMD, 8Gb RAM,
1000Gb @ 7200RPM disk running CentOS 5.8), by just measuring the time
it took to read tables of various sizes:

A         B    C   D  E F       G       H       I       J       K
t1_s1    SEQ   3   0  0 10k     30k     1       0.0ms   17.7ms  25.81ms
t1_s2    SEQ   3   0  0 100k    300k    1       0.0ms   162.4ms 171.56ms
t1_s3    SEQ   3   0  0 1m      3m      1       0.0ms   1.6sec  1.58sec

Analysis: So far, so good -- sequential query time is linear (in fact, I got the
same performance from a solid-state disk, so the OS is caching all disk
access).

A         B    C   D  E F       G       H       I       J       K
t1_s1     ID   3   0  0 10k     30k     10k     2.4sec  22.8ms  2.46sec
t2_s1     ID   3   0  0 10k     30k     10k     2.4sec  22.3ms  2.47sec
t3_s1     ID   3   0  0 10k     30k     10k     2.4sec  23.2ms  2.46sec
t4_s1     ID   3   0  0 10k     30k     10k     2.4sec  22.2ms  2.45sec
t5_s1     ID   3   0  0 10k     30k     10k     2.4sec  23.1ms  2.45sec

t1_s2     ID   3   0  0 100k    300k    100k    24.1sec 223.8ms 24.35sec
t2_s2     ID   3   0  0 100k    300k    100k    24.1sec 232.8ms 24.31sec
t3_s2     ID   3   0  0 100k    300k    100k    24.4sec 232.6ms 24.60sec
t4_s2     ID   3   0  0 100k    300k    100k    24.0sec 221.0ms 24.21sec
t5_s2     ID   3   0  0 100k    300k    100k    24.1sec 219.7ms 24.36sec

t1_s3     ID   3   0  0 1m      3m      1m      4.0min  2.3sec  4.02min
t2_s3     ID   3   0  0 1m      3m      1m      4.1min  2.3sec  4.13min
t3_s3     ID   3   0  0 1m      3m      1m      4.0min  2.3sec  4.02min
t4_s3     ID   3   0  0 1m      3m      1m      4.0min  2.2sec  4.05min
t5_s3     ID   3   0  0 1m      3m      1m      4.0min  2.3sec  4.02min

Analysis: Again the execution time is linear (240 seconds = 4 minutes),
but almost all of the time is being spent in query.

So what's the impact of adding a cache?

A         B    C   D  E F       G       H       I       J       K
t1_s1   RANGE  3 200  1 10k     30k     50      754.6ms 16.1ms  778.98ms
t2_s1   RANGE  3 200  1 10.2k   30.6k   51      0.8sec  17.0ms  0.82sec
t3_s1   RANGE  3 200  1 1.96m   5.87m   9.8k    2.6min  3.3sec  2.64min 
t4_s1   RANGE  3 200  1 655k    1.97m   3.3k    49.9sec 1.0sec  50.91sec
t5_s1   RANGE  3 200  1 873k    2.62m   4.4k    1.1min  1.4sec  1.12min

t1_s1   RANGE  3 200  9 10k     30k     50      799.2ms 16.3ms  0.82sec
t2_s1   RANGE  3 200  9 10.0k   30.1k   51      767.8ms 16.0ms  791.88ms
t3_s1   RANGE  3 200  9 1.62m   4.87m   8.1k    2.1min  2.6sec  2.13min 
t4_s1   RANGE  3 200  9 19k     58k     97      1.5sec  32.0ms  1.52sec
t5_s1   RANGE  3 200  9 10k     30k     50      762.1ms 16.4ms  781.71ms

t3_s1   RANGE  3 200 15 1.38m   4.15m   6.9k    1.8min  2.3sec  1.79min
t4_s1   RANGE  3 200 15 10k     30k     50      764.2ms 16.3ms  783.65ms
t5_s1   RANGE  3 200 15 10k     30k     50      770.3ms 15.5ms  788.86ms

t3_s2   RANGE  3 200 15 1.94m   5.81m   96.7k   191min  32.9sec 191.02min
t4_s2   RANGE  3 200 15 100k    300k    500     57.3sec 157.2ms 57.44sec
t5_s2   RANGE  3 200 15 100k    300k    500     57.7sec 160.4ms 57.91sec

t3_s1   RANGE  3 200 25 984.2k  2.95m   4.9k    1.3min  1.7sec  1.31min 
t4_s1   RANGE  3 200 25 10k     30k     50      745.9ms 16.5ms  765.51ms
t5_s1   RANGE  3 200 25 10k     30k     50      780.5ms 16.8ms  0.80sec

t3_s2   RANGE  3 200 25 18.9m   56.8m   94.7k   182min  30.2sec 182.61min
t4_s2   RANGE  3 200 25 100k    300k    500     55.5sec 166.1ms 55.71sec
t5_s2   RANGE  3 200 25 100k    300k    500     56.9sec 163.8ms 57.05sec

Analysis: Here's where it starts to get tricky -- When the data is tightly clustered
(t1/t2), using even one buffer block greatly reduced the query time (2/3 less),
but multiple buffers were not of much benefit (due to increased cache search
time for misses). When the data was was random, small numbers of cache
blocks hurt performance (64x more!), but as the randomness was reduced or
the cache size increased, then caching started to deliver the same benefit over
singleton queries.

Summary:
You're looking at something on the order of 12 minutes of geodatabase
I/O time to visit all rows in a 3 million row table once (assuming a relatively
new computer with adequate RAM and fast disk). That time could be reduced
to ~4 minutes, but only if you do some preprocessing on the data to make sure
it's clustered**, or increase the cache size (block size before number of
blocks) until you get adequate performance (actually, you're probably better
off giving all the RAM you can, then backing off until it starts to choke).

*I wanted to research the impact on additional columns in the cache, but
ran out of time

**One possible metric to determine how clustered your data is would be to
walk the From-To pairs, and calculate a mean distance -- the closer to 1/2
the table length, the more fragmented it is. If you choose a block size of 1/3
the mean distance and nine blocks, you'll probably get good performance.

- V

TomMoore · ‎07-12-2012

Wow. Thanks for the excellent suggestions and analysis! There are lots of great ideas there.

I am very impressed by the times that you reported for the sequential access, especially for the s3 dataset. The 1.58 second time to read 1 million rows is especially fast. As you say, perhaps the entire table was hot within the OS page cache, so no actual IO was required. My tables will certainly have more than 3 columns, and I will be reading feature data, so I doubt that I'll have this benefit.

In the ID scan the times seem a bit more in line with what I was expecting. I guess that most of the 4 minutes query time was spent just parsing and executing the query, versus the actual IO involved in fetching records. After all, the actual table should still be within the OS cache. For my 3 million record query this will add up to 12 minutes. This is the overhead that I was trying to avoid, but maybe it is just the price that has to be paid.

How did you implement the RANGE blocked row seeking method? I am unclear about what was happening here. As I understand it, you have restricted the randomization of pointers to the next record to increase the locality, so that they will not point to records beyond the 3x3 (t4) or 5x5 (t5) window of neighbouring cells. I imagine that you had updated a column each row to indicate the cell it belonged to. When a row was needed and not in the LRU cache, then all rows from that cell were read and added to the cache. The query used would select all records in the cells of interest. Is this about right?

For the application that I am interested in, I was planning on doing a couple of sweeps through the data set. The first sweep would scan all rows. In this pass I would build an in-memory list of the order in which to visit rows in the second pass. At this point I could update a block cluster column in the data set with knowledge from the first sweep, and use a block caching technique like you suggest. However, it is possible that cost of updating every row might not be worth the savings in access time.

If I didn't want to update the table with my clustering information, is there any way to improve the query to grab multiple records in one go? For example, if I know I want to process features 1, 47, 68 and 71 together, is there a form of query that I could use to grab these records that would be more efficient than

SELECT * FROM table WHERE OBJECTID = 1 or OBJECTID = 47 or OBJECTID = 68 or OBJECTID = 71

Sorry if this is a dumb question, but I am not an SQL guy, and the efficient methods of dealing with records sets eludes me.

Cheers,

Tom

VinceAngelo · ‎07-12-2012

I don't have the source where I am at the moment, but block calculation was done
on the fly (id 87 w/ block size 20 and offset 1 is in the page 81-100):

((87 - 1) / 20) * 20 + 1 = 81

OR queries are generally inefficent in SQL (each gets its own pass through the B-tree,
though there is no real SQL engine here, just the SQL-like FGDB interface). I didn't
compare the performance of a sparse return set, though in theory, the blocking code
I wrote would handle that (I used a bitmap in each blocked page to indicate rows
returned). In general, the idea behind blocking was to read IDs 1-200 upon
encountering a request for ID=1, on the assumption that requests for 47, 68, and 71
would follow.

- V

PS: I don't see any benefit to two passes -- there isn't anything you'd learn in the
first pass (besides OS caching) that would make the second pass through a fragmented
table easier.

TomMoore · ‎07-12-2012

One of the things I want to do was geometry-ordered sweeps through data sets. The first pass will be sequential and I'll capture bounding boxes and some attributes of interest. In the second pass I'll go through again and compute metrics of interest on neighbourhoods of features, based on the neighbour relationships determined in the first pass. Doing this in one pass would require keeping a lot of active objects. My suspicion was that two passes would be better behaved in terms of memory usage. Two passes certainly will be simpler to implement. Premature optimization perhaps.

So, what I was hoping to do was to efficiently read in my neighbour sets in the second pass, rather than reading rows one by one in non-sequential order. It does not sound like there is a way to speed things up without making assumptions about ordering, so I think I'll just give it a try and see what happens.

VinceAngelo · ‎07-12-2012

The solution to spatial fragmentation is defragmenting the table, but there are no simple tools
to do this (I started on an ArcSDE defragmenter, but never got a chance to complete it). In
theory, it's a simple task -- walk all the index cells in order, querying all features in the cell,
and writing features that don't overlap with a previously processed cell -- but using more RAM
in your cache is likely to be faster (in terms of time to completion of the first processing run).

- V

DavidSousa · ‎07-17-2012

Wow. Thanks for the excellent suggestions and analysis! There are lots of great ideas there.

<snip>

If I didn't want to update the table with my clustering information, is there any way to improve the query to grab multiple records in one go? For example, if I know I want to process features 1, 47, 68 and 71 together, is there a form of query that I could use to grab these records that would be more efficient than
SELECT * FROM table WHERE OBJECTID = 1 or OBJECTID = 47 or OBJECTID = 68 or OBJECTID = 71
Sorry if this is a dumb question, but I am not an SQL guy, and the efficient methods of dealing with records sets eludes me.

Cheers,

Tom

Instead of a WHERE clause with a bunch of OR's in it, I suggest you use an IN predicate. Instead of this:

    WHERE OBJECTID = 1 or OBJECTID = 47 or OBJECTID = 68 or OBJECTID = 71

Try this:

    WHERE OBJECTID IN (1, 47, 68, 71)

For any indexed field, this will tend to execute more quickly.

For range queries, try BETWEEN predicate. Instead of this:

    WHERE OBJECTID >= 1 AND OBJECTID <= 47

Try this:

    WHERE OBJECTID BETWEEN 1 AND 47

This is again assuming the existence of an index.