Pearson correlation in raster datasets

AlanFernandes · ‎09-27-2010

Dear all,

I need to make a correlation between two raster datasets. I want to use Raster Calculator, but I'm still finding a bit difficult to work it out. I have the formula from statistics books, and I know that I'm required to multiply, and add the grids, right? But it's not so straight forward.
How am I supposed to find N, for instance? Should I just multiply the Columns and Rows from my grid to see how many pixels I have on the dataset?
The formula for the Pearson correlation coefficient is quite long and I don't think I should attempt to do everything in one go. I reckon I should do it by different stages.
Any ideas would be very welcome!

Just a couple more things about the two datasets: one of them is a NDVI that I acquired from LANDSAT 5TM, and the other raster has been originated from an interpolation of points (rainfall gauges). I have used Mask on both grids, therefore they have the same number or rows and columns (same number of pixels?).

Thanks for your time.
Alan.

AlanFernandes · ‎09-27-2010

Bill,

Thank you very much for your reply. I will try what you suggested and I should be coming back here with either some results for discussions (or more questions!) if that's okay.

Just a couple of things regarding the interpolation. My area of study is data scarce and I didn't have a choice. I would have only three points of NDVI where the data was actually collected. I think I will take the risk and still use the interpolation/NDVI analysis - or do you think the results would be worthless?

Regards,
Alan.

AlanFernandes · ‎09-28-2010

Hi again Bill, I hope you'll see this.

I'm trying to get my head around some of your instructions, but with no success. Now I have the indicator grid for the overlapping datasets, but I'm very confused with how to use it in the calculations. I don't think I quite understood the following:

�?�Wherever you see an ordinary unary or binary arithmetic operation in the formula (including, but not limited to, negation, addition, subtraction, multiplication, division, and powers), use the same one in Map Algebra.
�?�Wherever you see a sum over cases compute a zonal sum. If the sum is divided by N it's really an average so just use a zonal average for the whole thing
That went straight over my head - really sorry.
If you have the chance to clarify that - it would be so much appreciated. I will keep on trying in the meantime.

Regards,
Alan.

AlanFernandes · ‎09-28-2010

Thanks a lot for that, Bill. You've been a great help. Very much appreciated.

AlanFernandes · ‎09-30-2010

Bill,

Following the instructions you gave me above, I still have some doubts if what I'm doing is correct.
I followed the steps, but some there's some confusion:

In the formula xi_i = (x_i - m)/s, for instance, the zone grid is x_i, correct? I'm not sure how to obtain that zone grid (and I'm sure this is really simple!). Is this anything to do with the indicator grid I created initially?

Another question is regarded the zonal average of [xi]*[eta] that you mentioned. Do I obtain that by simply multiplying those two grids and dividing it by N?

AlanFernandes · ‎09-30-2010

Hi Bill,

I think I've figured it out now. And yes, it was very simple. But feel free to make any comments. I'm new to all this, so any reassurance is more than welcome.

Many thanks.
Alan.

AlanFernandes · ‎10-03-2010

Bill,

I'm getting very strange results for the correlations, so I'm assuming there's something wrong with the way I'm creating the zone grid, unfortunately. I don't seem to get this one constant value at every cell that you mentioned above.
Is the formula for the standardized values xi_i = (x_i - m)/s the one that should be used to create this grid?
I've used the indicator to produce grids of the means ([mX] and [mY], and standard deviations ([sX] and [sY]). But in the formula above, what will be the x_i? I know this should be a very straight forward operation, but I'm sure what I'm doing is wrong.

Thank you.
Alan.

AlanFernandes · ‎10-06-2010

Bill,
My question remains:
if I�??m computing Z as ( - )/ ~~for grids (the original data), (a constant grid equal to the mean of throughout the zone), and ~~(a constant grid equal to the sd of within the zone),~~~~
by (original data) do you mean:
1) �?? the sum of using the INDICATOR grid (created earlier) in the �??feature zone�?� field on SA Zonal Statistics, VALUE in the SA �??zone field�?�, and the ORIGINAL DATA as the �??input raster�?� field?
If that�??s the case, my result will be a range of values (high to low), which will give me 2 values for a correlation in the end. Note that I will be doing the same thing for the original value, which will also give me �??high to low�?� values.
2) �?? if by (original data) you mean the raw raster data (say, my NDVI data), I will also have �??high and low�?� values.
Basically, my only question is: what exactly is in the formula ( - )/?

Sorry for being such a hard work.

PhilMorefield · ‎11-24-2010

This is a great example of how incorporating R Statistical Software into ArcGIS would be a huge benefit. R has no problem calculating correlations on data that are provided as an "array", i.e. rows and columns. There is no need to reinvent the wheel, although it does make for an interesting excercise.

My advice would be to abandon ArcGIS for the heavy lifting and just download R. It's notoriously ugly to program, but with a few Google searches I bet you could write correlation scripts in a couple of hours. Using Python, R, and rpy2 I've written script tools to automate this procedure. But....

One issue I haven't been able to resolve is the reliability of correlation coefficients when the data are spatially autocorrelated, which may be a problem.

I'll get off my soapbox now, but this is an area in which I would love to see more improvement.

GuyDuke · ‎11-29-2010

In ArcInfo workstation this was as easy as typing the following at the grid prompt.

correlation grid1 grid2

ESRI - why have we lost functions with ArcGIS?