Spearman's rank or Pearson's correlation coefficient?

8344
23
10-25-2015 02:47 AM
CharlotteJ
New Contributor II

This isn't a GIS question so much but I'd like to correlate the percentage of the population of census output areas who are children with mean distances to green spaces, but am unsure whether to use Spearman's Rank or the Pearson's Correlation Coefficient. I understand that Spearman's rank is best used for ordinal variables, which I don't think either of these 2 are, so perhaps it's better to use Pearson's? But then I've also read that Pearson's is better used when the relationship between variables is linear which I'm not sure it is in this case, so I'm unsure which option's better. Any advice would be much appreciated.

0 Kudos
23 Replies
CharlotteJ
New Contributor II

Sorry I'm not sure that I understand your first point, I decided on the classifications myself by just dividing up the range covered by each of the variables, so that each separate category covered an equal range. I thought this would be the clearest way of showing how access and the distributions of each group vary across the city. Or would you suggest using natural breaks in the data to determine the classifications?

And yes, dropping the decimals sounds sensible, and possibly using absolute numbers instead of percentages also seems to make sense. Because I suppose that one area may have a lower number of older people than another, but the percentage may be higher. Or perhaps mapping population density may be sensible? Then the area of each of the output areas can also be accounted for, as well as the absolute populations, so may be a better indicator of the distribution of each social group.

And thank you for your other points, I'll take them all on board. I'm intending on writing about limitations and uncertainties such as these in my discussion as I understand that they're all valid points. And regarding your 6th point, I only mapped out spaces I understood were publically accessible.

0 Kudos
DanPatterson_Retired
MVP Emeritus

your classification is fine...albeit qualitative and probably doesn't differ much from NBs ... of course you could make a statement to some effect should you have investigated that (remember....I am playing devil's advocate)

population density would be a reasonable compromise and may yield other information.  You may find that total population density may reveal a different pattern with green space itself.

In reference to the 6th point...it may be worth while to remove spaces where people don't live and green space couldn't exist...examples given in my list.  This will also affect population density.  Consider a 1 km^2 area, 95% by gov't buildings, 2.5% green space and the remaining 2.5% seniors residence....you conclusion would be?

The nice thing about talking about the limitations up front is that you show that some thought went into the whole project...it won't ever be perfect...in 5 years you will think of something else you could have done...in 10 maybe more...but you have to give it up sometime.  Your only obstacle right now is to get a thesis done and a defense (if applicable) completed by warding off the external advisor who asks those innocuous questions that you hadn't thought of.  The one that threw me was ... and how does you work fit into the bigger picture of (your program here)????

DanPatterson_Retired
MVP Emeritus

PS

make sure you are aware of the capabilities of the two main toolsets

An overview of the Spatial Statistics toolbox—Help | ArcGIS for Desktop

An overview of the Geostatistical Analyst toolbar and toolbox—Help | ArcGIS for Desktop

so you aren't caught off guard.  you can dismiss an approach by putting the analysis in context of the importance to your discussion

0 Kudos
CharlotteJ
New Contributor II

Ok then, and yes perhaps I will change the percentages for each output area to population densities then.

Thank you for your advice, yes as long as I show that I've acknowledged any uncertainties surrounding my study, I think this will be ok.

And in terms of where my work fits into the bigger picture of geography, I suppose that measuring accessibility to services is important for comparing the spatial distribution of demand relative to supply, and looking at how access can vary spatially. As access to green spaces is generally considered to be associated with improvements in wellbeing, assessments into the adequacy and equality of access across cities seems important to investigate whether people have equal opportunities to yield these benefits to wellbeing, regardless of where they live. I can then use my assessments to identify areas which are potentially in greatest need of improvements to access. Hope that addresses the question.

Throughout my study I've measured distances to different types of spaces and calculated the percentages of the population for which different accessibility standards set for green spaces have been met. However I feel that these measures are all quite simplistic, since my results only really consist of average distances and percentages. So this is why I was hoping to bring in some statistical analysis by testing to see how the distributions of social groups vary with access, and whether green space are therefore well located relative to demand. But would you suggest that this isn't really worthwhile since there clearly won't be strong relationships between the variables? But if I do decide to use Pearson's to measure correlations between the variables, would you mind explaining what would be involved with transforming my data as you previously mentioned to do? You mentioned to log the percentages, rather than using the original values?

0 Kudos
DanPatterson_Retired
MVP Emeritus

I think your background support is fine

When you talk about distance, you have been using Euclidean distance (ie crow-flies distance) and not network distance (ie travel along roads).  This can be mention, but do not do it since it opens up a whole can of worms and will mask anything that you would gain.

As for transformation...I was trying to make the point that people will go to elaborate lengths to use parametric statistics (ie simplistically, for data which has a normal distribution) rather than use there non-parametric equivalent (ie pearson's versus spearman's) because they think that parametric statistics are somehow superior...they aren't.  So if you can explain what  ... log(%>65) + 0.25 really means then don't go there (the answer by the way is ... the distribution is totally weird and this is the equation that made it look normal.  Other things I have heard... my advisor told me to normalize it.  Or ... not sure, everyone uses pearson's don't they?!?)

Finally you aren't studying a correlation, you are studying an association ... should an examining board start arguing over the semantics, just let them go at it and keep out of the fray it boils down to causality at times.

CharlotteJ
New Contributor II

Oh good. And I have actually been measuring distances by network distance, using Arcmap's network analyst extension. I see, so do you think it would be fine using spearman's rank over pearson's then? I would do chi squared but I don't feel very confident with it as I've never done it before. But I didn't think that Spearman's should be used for testing the association between 2 ratio variables (distance and percentage), as I was under the impression that at least one variable should be ordinal in order to use it?

0 Kudos
DanPatterson_Retired
MVP Emeritus

Personally, I would like to see you class the data into groups and run the chi square test...in spreadsheets

=CHISQ.TEST there is an example in there.  or for spearman's...  Dr Google has many links.

http://blog.excelmasterseries.com/2014/05/spearman-correlation-coefficient-in.html

CharlotteJ
New Contributor II

Ok then, I'll give it a go if that would be more appropriate. And thank you for the link. But just for future reference, I've done quite a bit of google searching but am still a bit confused as to whether it is alright to use 2 ratio variables to carry out spearman's rank. Is it ok to do this then, even if one of the variables isn't ordinal?

0 Kudos
DanPatterson_Retired
MVP Emeritus

No..... that is why you have to rank them before pearson's since it is the correlation between the ranks and not the actual data itself... although Chi is the best since it is an association/difference test and not a correlation which implies something else.... trust me...classify your data, get the counts in each class, the expected values are easy to determine then.... it's Chi time!!!

0 Kudos
DanPatterson_Retired
MVP Emeritus

I also wonder why no else is chiming in here?  Is everyone else a mathphobic   I hope you are running these ideas by your advisor...I want to make sure you are on her/his page and not just listening to me...

Also should you have some time and want some amusement and/or eye openers, you should check out

Academia on Stack Exchange

It often provides me with much merriment and often insight.

0 Kudos