The buffer distances from all available records are included as covariates in the model, along with other environmental predictors.

This genius idea introduces spatial autocorrelation into the model in a very simple and natural way!
Also, Random Forest, as is, does not really "understand" spatial data, but when providing it with distance maps as covariates, we are teaching it Tobler's First Law of Geography:

"Everything is related to everything else. But near things are more related than distant things."
The type of model proposed by @tom_hengl et al. would look like this:

y ~ e1 + ... + en + d1 + ... + dn

where e1 to en are the environmental predictors, and d1 to dn are the distances to each pair of coordinates in the input data.
They compare this type of model (RFsp) with RF models based on environmental variables alone, and with ordinary kriging and regression kriging across a broad range of different spatial datasets...
...to find that RFsp predictions are similar to those produced by the different kriging methods, but with the advantage of not requiring defining and fitting variograms, easily including spatial autocorrelation, and providing with robust variable importance scores.
The only obvious drawback of RFsp comes when the number of input records is high (>1000). In such a case, generating separate distance maps for each record might be computationally intensive, depending on the resources available on the computer.
I have recently been applying this method to model the distribution of a marine species in a collaboration with my dear friend, Laura Martín García, from the Spanish Institute of Oceanography.
Without unveiling too much about this ongoing collaboration, I can say that after evaluating models based on environmental predictors, distance predictors, and environment+distances with true absences (yay!) and spatially independent folds...
...I can show you that models based on both types of predictor show higher "true" AUC values (as opposed to non-sense AUC values resulting from pseudo-absence data) higher than those of distance-based or environment-based models.
The take-away message from this thread is "if you have some spatial data to model, RFsp might be worth the effort", and don't sleep on its applications to species distribution modeling!

A pretty good R tutorial on this method can be found here: https://github.com/thengl/GeoMLA 
To end, I want to say thanks to @tom_hengl for this neat idea, and especially for his site http://spatial-analyst.net . There I got valuable knowledge on GIS and spatial analysis, and lots of inspiration during the earlier stages of my PhD, eons ago.

Have a great day folks!
You can follow @BlasBenito.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.