Transformation inflated with 0's

4429
10
01-03-2012 06:32 AM
BrookeHodge
New Contributor III
Hi,

I am having an issue with attempting to use krigging on a dataset I have.  My data is not normally distributed and it is seriously 0 inflated.  I want to normalize it in order to use krigging, but because I have 0 values, my only option is to use a Normal Score transformation using simple krigging, however, even that doesn't seem to be helping much.  Is there any suggestions anyone can give me to normalize my data, or other interpolation options that might be better suited to my data?

Thank you in advance for your help.
Brooke
0 Kudos
10 Replies
EricKrause
Esri Regular Contributor
Do the zero and non-zero values cluster together?  In other words, does your map seem to be divided between regions of zeros and regions of non-zeros?  For example, we often see this with rainfall data: large areas of all zero (where it didn't rain) and large areas of all non-zero (where it did rain).
0 Kudos
BrookeHodge
New Contributor III
Hi,

Thank you for your response.  Generally, yes.  The data is of whale sightings, and there are areas where the whales are seen most often (and where the highest survey effort is), so there are generally a couple pockets where the values are high, surrounded by a lot of 0's.  The values are of numbers of whales accounting for the level of survey effort.  So there are ares where there is at least some effort, but no whales sighted.  These are most likely false 0's, meaning that those aren't necessarily areas where there are 0 whales there, but at the time of the survey, no whales were seen there.  I'm also working on other models that might deal with this issue, however, my overall goal is to created an interpolated surface of whale distribution, which is why I'm trying to figure out how to make my data appropriate to use a krigging method.

Thanks for your help on this,
Brooke
0 Kudos
EricKrause
Esri Regular Contributor
Would it be possible to send your data to ekrause@esri.com?  I have a couple ideas, but I need to see your data to know if they will work.

Even if you can't send your data, send me an email anyway, and I'll try to point you in the right direction.
0 Kudos
JeffreyEvans
Occasional Contributor III
Sun et. al., (2003) proposed a double kriging approach where you specify an indicator kriging model to derive a mask then make your estimates using ordinary kriging with log-transformed data. I attached the Sun et. al., 2003 paper.

Depending on the size of your problem I would highly recommend investigating Zero Inflation Poisson (ZIP) regression as an alternative to kriging. A good ZIP reference for spatial models is Agarwal et. al., 2003 (http://www.springerlink.com/content/w355777u4xk83426/).
0 Kudos
BrookeHodge
New Contributor III
Eric, I emailed you with  my data.  Thank you for taking a look at this!

jevans02, thanks so much for your reply.  I have heard of a double kriging process, but haven't dove too deep into it.  I will look at it more closely  now, thanks for attaching the paper.  I have been trying to figure out the zero-inflated negative binomial mixture model sometimes used to deal with this, but I will also look at the ZIP model you referenced.  And thanks again for sending a reference!  I appreciate it!
0 Kudos
NicoleSkelton
New Contributor

Hello  wikgrebc, I am using a very similar dataset and wondered if you can recall (4 years after you posted!) and share any insights.  I have >9000 observation points, with 1092 sightings, so mostly zero's.  From my literature search, I've narrowed it down to try Empirical Bayesian Kriging using geostatistical analyst (because I don't have time to learn to use R software, which is often mentioned in literature review for kriging with a poisson distribution).   I've built some kriging models but the standardized mean and RMS says NaN.  I've just tried removing all datapoints with a zero and building from that dataset, it helped with the NaN problem, but I'm not sure this is statistically justifiable. Any insights you might have are appreciated.  These are sightings of a rare bird species, where a positive sighting is a count between 1-4 because the data have been rolled into centroids of a grid applied to the study area.

0 Kudos
EricKrause
Esri Regular Contributor

I don't have any good solutions for the zero-inflation problem (other than, potentially, the one posted by jevans), but I can shed some light on the NaN issue.  Empirical Bayesian Kriging computes local models based on subsets of the data.  However, if a subset is composed entirely of a constant value (in this case, zero), the kriging equations cannot actually be solved (the algorithm will attempt to invert something that cannot be inverted).  We put in a special exception when this happens; the subset will predict a constant value everywhere in the subset with a standard error value of zero.  This alleviates the problem of a single subset preventing you from getting any output at all, but it has the size effect of making it impossible to calculate many crossvalidation statistics.  This is why many of them will report NaN (Not a Number) when this happens.

0 Kudos
NicoleSkelton
New Contributor

Thank you EKrause-esristaff for this explanation. It helps.  I'm not sure if this too large a question to pose, but I will ask anyway. Would you say that using Empirical Bayesian Kriging on zero-inflated data creates an invalid model?   I have downloaded the articles suggested above and will continue to investigate that route but for my purposes I was optimistically hoping EBK might be justifiable.

0 Kudos
EricKrause
Esri Regular Contributor

Doing any kind of kriging (other than fancy things like double kriging, as jevans noted) is going to be very questionable with high degrees of zero-inflation.  Technically, zero-inflation does not mean that using kriging is "invalid," but the issue is pretty complicated.  In theory, kriging can be performed on just about any data distribution, even ones with zero inflation.  However, to do this correctly, you need to be able to accurately estimate the semivariogram with the typical covariance functions that are provided.  In practice, fitting an accurate semivariogram for strange data distributions is not something that I would recommend because I don't know of a reliable methodology to do it.  And the covariance models that are supported in Geostatistical Analyst are very, very unlikely to be able to estimate an accurate semivariogram for data with large zero-inflation.

0 Kudos