Random Forest is a pretty popular modeling method due to its ability to deal with high-dimensional data without requiring any assumptions to be met, and its flexibility to model both discrete and continuous response variables. https://medium.com/rants-on-machine-learning/the-unreasonable-effectiveness-of-random-forests-f33c3ce28883
A couple of years ago @tom_hengl et al. published the paper "Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables" https://peerj.com/articles/5518/
The paper presents a simple idea to boost the ability of RF to predict spatial data.
The paper presents a simple idea to boost the ability of RF to predict spatial data.
The buffer distances from all available records are included as covariates in the model, along with other environmental predictors.
This genius idea introduces spatial autocorrelation into the model in a very simple and natural way!
This genius idea introduces spatial autocorrelation into the model in a very simple and natural way!
Also, Random Forest, as is, does not really "understand" spatial data, but when providing it with distance maps as covariates, we are teaching it Tobler's First Law of Geography:
"Everything is related to everything else. But near things are more related than distant things."
"Everything is related to everything else. But near things are more related than distant things."
The type of model proposed by @tom_hengl et al. would look like this:
y ~ e1 + ... + en + d1 + ... + dn
where e1 to en are the environmental predictors, and d1 to dn are the distances to each pair of coordinates in the input data.
y ~ e1 + ... + en + d1 + ... + dn
where e1 to en are the environmental predictors, and d1 to dn are the distances to each pair of coordinates in the input data.
They compare this type of model (RFsp) with RF models based on environmental variables alone, and with ordinary kriging and regression kriging across a broad range of different spatial datasets...
...to find that RFsp predictions are similar to those produced by the different kriging methods, but with the advantage of not requiring defining and fitting variograms, easily including spatial autocorrelation, and providing with robust variable importance scores.
The only obvious drawback of RFsp comes when the number of input records is high (>1000). In such a case, generating separate distance maps for each record might be computationally intensive, depending on the resources available on the computer.
I have recently been applying this method to model the distribution of a marine species in a collaboration with my dear friend, Laura Martín García, from the Spanish Institute of Oceanography.
Without unveiling too much about this ongoing collaboration, I can say that after evaluating models based on environmental predictors, distance predictors, and environment+distances with true absences (yay!) and spatially independent folds...
...I can show you that models based on both types of predictor show higher "true" AUC values (as opposed to non-sense AUC values resulting from pseudo-absence data) higher than those of distance-based or environment-based models.
The take-away message from this thread is "if you have some spatial data to model, RFsp might be worth the effort", and don't sleep on its applications to species distribution modeling!
A pretty good R tutorial on this method can be found here: https://github.com/thengl/GeoMLA
A pretty good R tutorial on this method can be found here: https://github.com/thengl/GeoMLA
To end, I want to say thanks to @tom_hengl for this neat idea, and especially for his site http://spatial-analyst.net . There I got valuable knowledge on GIS and spatial analysis, and lots of inspiration during the earlier stages of my PhD, eons ago.
Have a great day folks!
Have a great day folks!