So I invested a little hobby-time to this question, to determine the benefitof caching on runtime performance...First, I adapted some tools and created a benchmark file geodatabase containingfifteen tables, in three sizes of five layouts. The first layout contained asequential list, where row 1 pointed to row 2 and row 2 to row three, and soon until the last row, which pointed back to row 1:
1 2
2 3
... N
N 1
The second layout was in reverse order:
1 N
2 1
N-1 ...
N N-1
The third layout was in random order:
1 7
2 4
3 6
4 5
5 1
6 2
7 3
The fourth and fifth layouts were also in random order, but clustered so thatthe selection window was limited by twenty-five or nine neighboring cells, tocapture the impact of blocking the data in 200 row chunks. Each layout wasgenerated with 10000 rows (s1), 100000 rows (s2), and 1 million rows (s3), and merged with 40 additional columns to reflect ancillary data in the tables.Then I whipped up a benchmarking C++ application, which operates in threemodes:SEQ Sequential read of table contents, without regard for the "next" row
ID Simple row seeking, one query per row
RANGE Blocked row seeking, one query per optional block size, with an
optional number of cached blocks
The cache uses a "least recently used" (LRU) replacement strategy, so thateach block that is used is promoted to the front of the cache, and the lastblock's contents are replaced if no cache hit was found.The benchmarking tool reports the following information:A - table name (tN_sM)
B - mode
C - number of columns*
D - Size of cache blocks (in rows)
E - number of blocks in the cache
F - Number of rows read
G - Number of row.Get{datatype} executions
H - Number of table.Search invocations
I - query time (time spend in table.Search and enumRows.Next)
J - get IO time (time spent in row.Get{datatype}
K - elapsed time (total time to execute)
Then I started bechmarking on my home PC (6x2400Mhz AMD, 8Gb RAM, 1000Gb @ 7200RPM disk running CentOS 5.8), by just measuring the time it took to read tables of various sizes:A B C D E F G H I J K
t1_s1 SEQ 3 0 0 10k 30k 1 0.0ms 17.7ms 25.81ms
t1_s2 SEQ 3 0 0 100k 300k 1 0.0ms 162.4ms 171.56ms
t1_s3 SEQ 3 0 0 1m 3m 1 0.0ms 1.6sec 1.58sec
Analysis: So far, so good -- sequential query time is linear (in fact, I got the same performance from a solid-state disk, so the OS is caching all disk access).A B C D E F G H I J K
t1_s1 ID 3 0 0 10k 30k 10k 2.4sec 22.8ms 2.46sec
t2_s1 ID 3 0 0 10k 30k 10k 2.4sec 22.3ms 2.47sec
t3_s1 ID 3 0 0 10k 30k 10k 2.4sec 23.2ms 2.46sec
t4_s1 ID 3 0 0 10k 30k 10k 2.4sec 22.2ms 2.45sec
t5_s1 ID 3 0 0 10k 30k 10k 2.4sec 23.1ms 2.45sec
t1_s2 ID 3 0 0 100k 300k 100k 24.1sec 223.8ms 24.35sec
t2_s2 ID 3 0 0 100k 300k 100k 24.1sec 232.8ms 24.31sec
t3_s2 ID 3 0 0 100k 300k 100k 24.4sec 232.6ms 24.60sec
t4_s2 ID 3 0 0 100k 300k 100k 24.0sec 221.0ms 24.21sec
t5_s2 ID 3 0 0 100k 300k 100k 24.1sec 219.7ms 24.36sec
t1_s3 ID 3 0 0 1m 3m 1m 4.0min 2.3sec 4.02min
t2_s3 ID 3 0 0 1m 3m 1m 4.1min 2.3sec 4.13min
t3_s3 ID 3 0 0 1m 3m 1m 4.0min 2.3sec 4.02min
t4_s3 ID 3 0 0 1m 3m 1m 4.0min 2.2sec 4.05min
t5_s3 ID 3 0 0 1m 3m 1m 4.0min 2.3sec 4.02min
Analysis: Again the execution time is linear (240 seconds = 4 minutes),but almost all of the time is being spent in query.So what's the impact of adding a cache?A B C D E F G H I J K
t1_s1 RANGE 3 200 1 10k 30k 50 754.6ms 16.1ms 778.98ms
t2_s1 RANGE 3 200 1 10.2k 30.6k 51 0.8sec 17.0ms 0.82sec
t3_s1 RANGE 3 200 1 1.96m 5.87m 9.8k 2.6min 3.3sec 2.64min
t4_s1 RANGE 3 200 1 655k 1.97m 3.3k 49.9sec 1.0sec 50.91sec
t5_s1 RANGE 3 200 1 873k 2.62m 4.4k 1.1min 1.4sec 1.12min
t1_s1 RANGE 3 200 9 10k 30k 50 799.2ms 16.3ms 0.82sec
t2_s1 RANGE 3 200 9 10.0k 30.1k 51 767.8ms 16.0ms 791.88ms
t3_s1 RANGE 3 200 9 1.62m 4.87m 8.1k 2.1min 2.6sec 2.13min
t4_s1 RANGE 3 200 9 19k 58k 97 1.5sec 32.0ms 1.52sec
t5_s1 RANGE 3 200 9 10k 30k 50 762.1ms 16.4ms 781.71ms
t3_s1 RANGE 3 200 15 1.38m 4.15m 6.9k 1.8min 2.3sec 1.79min
t4_s1 RANGE 3 200 15 10k 30k 50 764.2ms 16.3ms 783.65ms
t5_s1 RANGE 3 200 15 10k 30k 50 770.3ms 15.5ms 788.86ms
t3_s2 RANGE 3 200 15 1.94m 5.81m 96.7k 191min 32.9sec 191.02min
t4_s2 RANGE 3 200 15 100k 300k 500 57.3sec 157.2ms 57.44sec
t5_s2 RANGE 3 200 15 100k 300k 500 57.7sec 160.4ms 57.91sec
t3_s1 RANGE 3 200 25 984.2k 2.95m 4.9k 1.3min 1.7sec 1.31min
t4_s1 RANGE 3 200 25 10k 30k 50 745.9ms 16.5ms 765.51ms
t5_s1 RANGE 3 200 25 10k 30k 50 780.5ms 16.8ms 0.80sec
t3_s2 RANGE 3 200 25 18.9m 56.8m 94.7k 182min 30.2sec 182.61min
t4_s2 RANGE 3 200 25 100k 300k 500 55.5sec 166.1ms 55.71sec
t5_s2 RANGE 3 200 25 100k 300k 500 56.9sec 163.8ms 57.05sec
Analysis: Here's where it starts to get tricky -- When the data is tightly clustered(t1/t2), using even one buffer block greatly reduced the query time (2/3 less), but multiple buffers were not of much benefit (due to increased cache search time for misses). When the data was was random, small numbers of cache blocks hurt performance (64x more!), but as the randomness was reduced or the cache size increased, then caching started to deliver the same benefit oversingleton queries.Summary: You're looking at something on the order of 12 minutes of geodatabaseI/O time to visit all rows in a 3 million row table once (assuming a relativelynew computer with adequate RAM and fast disk). That time could be reducedto ~4 minutes, but only if you do some preprocessing on the data to make sureit's clustered**, or increase the cache size (block size before number ofblocks) until you get adequate performance (actually, you're probably betteroff giving all the RAM you can, then backing off until it starts to choke).*I wanted to research the impact on additional columns in the cache, butran out of time**One possible metric to determine how clustered your data is would be to walk the From-To pairs, and calculate a mean distance -- the closer to 1/2the table length, the more fragmented it is. If you choose a block size of 1/3 the mean distance and nine blocks, you'll probably get good performance.- V