11 Replies Latest reply on Dec 27, 2011 4:21 PM by sarah192837

    Pearson correlation in raster datasets

    alanfer17
      Dear all,

      I need to make a correlation between two raster datasets. I want to use Raster Calculator, but I'm still finding a bit difficult to work it out. I have the formula from statistics books, and I know that I'm required to multiply, and add the grids, right? But it's not so straight forward.
      How am I supposed to find N, for instance? Should I just multiply the Columns and Rows from my grid to see how many pixels I have on the dataset?
      The formula for the Pearson correlation coefficient is quite long and I don't think I should attempt to do everything in one go. I reckon I should do it by different stages.
      Any ideas would be very welcome!

      Just a couple more things about the two datasets: one of them is a NDVI that I acquired from LANDSAT 5TM, and the other raster has been originated from an interpolation of points (rainfall gauges). I have used Mask on both grids, therefore they have the same number or rows and columns (same number of pixels?).

      Thanks for your time.
      Alan.
        • Re: Pearson correlation in raster datasets
          alanfer17
          Bill,

          Thank you very much for your reply. I will try what you suggested and I should be coming back here with either some results for discussions (or more questions!) if that's okay.

          Just a couple of things regarding the interpolation. My area of study is data scarce and I didn't have a choice. I would have only three points of NDVI where the data was actually collected. I think I will take the risk and still use the interpolation/NDVI analysis - or do you think the results would be worthless?

          Regards,
          Alan.
          • Re: Pearson correlation in raster datasets
            alanfer17
            Hi again Bill, I hope you'll see this.

            I'm trying to get my head around some of your instructions, but with no success. Now I have the indicator grid for the overlapping datasets, but I'm very confused with how to use it in the calculations. I don't think I quite understood the following:

            �?�Wherever you see an ordinary unary or binary arithmetic operation in the formula (including, but not limited to, negation, addition, subtraction, multiplication, division, and powers), use the same one in Map Algebra.
            �?�Wherever you see a sum over cases compute a zonal sum. If the sum is divided by N it's really an average so just use a zonal average for the whole thing
            That went straight over my head - really sorry.
            If you have the chance to clarify that - it would be so much appreciated. I will keep on trying in the meantime.

            Regards,
            Alan.
            • Re: Pearson correlation in raster datasets
              alanfer17
              Thanks a lot for that, Bill. You've been a great help. Very much appreciated.
              • Re: Pearson correlation in raster datasets
                alanfer17
                Bill,

                Following the instructions you gave me above, I still have some doubts if what I'm doing is correct.
                I followed the steps, but some there's some confusion:

                In the formula xi_i = (x_i - m)/s, for instance, the zone grid is x_i, correct? I'm not sure how to obtain that zone grid (and I'm sure this is really simple!). Is this anything to do with the indicator grid I created initially?

                Another question is regarded the zonal average of [xi]*[eta] that you mentioned. Do I obtain that by simply multiplying those two grids and dividing it by N?
                • Re: Pearson correlation in raster datasets
                  alanfer17
                  Hi Bill,

                  I think I've figured it out now. And yes, it was very simple. But feel free to make any comments. I'm new to all this, so any reassurance is more than welcome.

                  Many thanks.
                  Alan.
                  • Re: Pearson correlation in raster datasets
                    alanfer17
                    Bill,

                    I'm getting very strange results for the correlations, so I'm assuming there's something wrong with the way I'm creating the zone grid, unfortunately. I don't seem to get this one constant value at every cell that you mentioned above.
                    Is the formula for the standardized values xi_i = (x_i - m)/s the one that should be used to create this grid?
                    I've used the indicator to produce grids of the means ([mX] and [mY], and standard deviations ([sX] and [sY]). But in the formula above, what will be the x_i? I know this should be a very straight forward operation, but I'm sure what I'm doing is wrong.

                    Thank you.
                    Alan.
                    • Re: Pearson correlation in raster datasets
                      alanfer17
                      Bill,
                      My question remains:
                      if I�??m computing Z as ([x] - [m])/[s] for grids [x] (the original data), [m] (a constant grid equal to the mean of [x] throughout the zone), and [s] (a constant grid equal to the sd of [x] within the zone),
                      by [x] (original data) do you mean:
                      1) �?? the sum of [x] using the INDICATOR grid (created earlier) in the �??feature zone�?� field on SA Zonal Statistics, VALUE in the SA �??zone field�?�, and the ORIGINAL DATA [X] as the �??input raster�?� field?
                      If that�??s the case, my result will be a range of values (high to low), which will give me 2 values for a correlation in the end. Note that I will be doing the same thing for the [Y] original value, which will also give me �??high to low�?� values.
                      2) �?? if by [x] (original data) you mean the raw raster data (say, my NDVI data), I will also have �??high and low�?� values.
                      Basically, my only question is: what exactly is [x] in the formula ([x] - [m])/[s]?

                      Sorry for being such a hard work.
                      • Re: Pearson correlation in raster datasets
                        philmorefield
                        This is a great example of how incorporating R Statistical Software into ArcGIS would be a huge benefit. R has no problem calculating correlations on data that are provided as an "array", i.e. rows and columns. There is no need to reinvent the wheel, although it does make for an interesting excercise.

                        My advice would be to abandon ArcGIS for the heavy lifting and just download R. It's notoriously ugly to program, but with a few Google searches I bet you could write correlation scripts in a couple of hours. Using Python, R, and rpy2 I've written script tools to automate this procedure. But....

                        One issue I haven't been able to resolve is the reliability of correlation coefficients when the data are spatially autocorrelated, which may be a problem.

                        I'll get off my soapbox now, but this is an area in which I would love to see more improvement.
                        • Re: Pearson correlation in raster datasets
                          GuyDuke
                          In ArcInfo workstation this was as easy as typing the following at the grid prompt.

                          correlation grid1 grid2

                          ESRI - why have we lost functions with ArcGIS?
                          • Re: Pearson correlation in raster datasets
                            jberryma
                            What is the difference between using the above procedure and using band statistics to calculate the correlation between rasters?  I am hoping to find the correlations between my environmental variables prior to further analysis in maxent.  Any thoughts would be greatly appreciated.
                                Thanks,
                                     Jessica
                            • Re: Pearson correlation in raster datasets
                              sarah192837
                              Like Jessica I am also trying to find which of my environmental layers are highly correlated before using them in Maxent. The correlation matrix that it produces has many values outside of the -1 to 1 range, can anyone tell me what the problem is there?  (I'm using the worldclim 1-19 layers, 30arc seconds, clipped to Europe).