I discussed some important aspects to take into account when validating land cover maps in a previous post. In that same post I insisted on the fact that machine learning pipeline building using a blind optimisation of accuracy metrics can lead to unrealistic expectations about land cover maps produced using these approaches.
I cited as an example the TiSeLaC challenge, where 2 of the participating teams achieved FScores above 99%, which is an accuracy higher than the one we can expect from the reference data used for the validation.
I assumed that this unrealistic performances where due to over-fitting and the use of a validation set too similar to the training set. I have recently asked the challenge organisers about the procedure for splitting the reference data into train and test sets and they confirmed that the split was done at the pixel level and not at the polygon level. Therefore, nearly identical pixels coming from the same polygon could be used for training and validation.
Therefore, looking at the challenge results, one could expect that all the teams would have got similar high performances. Since this was not the case, I asked for references to the methods used. Two of the methods are published. I am assuming that these are the 2 winning methods.
One of the methods uses spatial nearest neighbour classification to decide the labels, that is, the class for a pixel is decided using the labels of the nearest pixels of the training set. Here, "nearest" means the closest in the image using an Euclidean distance on the spatial coordinates of the pixel. Indeed, the pixel coordinates where provided as a separate record, but I don't think they were intended to be used as features. And, yes, the best results are obtained if only pixel coordinates are used (no reflectances, no NDVI, nothing!). And 1 single neighbour works best than 2-NN or 10-NN.
This shows that indeed, neighbouring pixels were present in the training and test sets, and the fewer the information used (just the closest pixel) the better the result obtained.
To quickly check this, I ran a simple, out-of-the-box, Random Forest classifier using the coordinates as features and got 97.90% accuracy on the test set, while using the image features gives about 90%.
The second of the 2 winning methods (which is actually the first with an FScore of 99.29 while the method above obtains 99.03), uses 3 deep neural networks, 2 of which use temporal convolutions for each pixel. The third network is a multi-layer perceptron were the input features are statistics computed on all the pixels found in a spatial neighbourhood of the pixel to be classified. Different sizes of neighbourhoods between 1 and 17 are used. This is much more complex than using only the label of the closest pixel, but actually, contains the same information. Adding the information of the 2 first networks may allow to correctly classify the few pixels that the previous method got wrong. The performance difference between the 2 methods is less than 0.3%, which may probably fall within typical confidence intervals.
What can we learn from these results?
First of all, blind metric optimisation without domain knowledge can produce misleading results. Any remote sensing scientist knows that pixel coordinates only are not good predictors for producing a map. Otherwise, one could just spatially interpolate the reference data. Even when applying krigging, other variables are usually used!
Second, when organising this kind of contest, realistic data sets have to be used. The split between training and validation has to follow strict rules in order to avoid neighbouring pixels appearing in both data sets.
Third, map validation has to have a spatial component: are the shapes of the objects preserved, is there image noise in the generated map, etc. This is a tricky question which needs either to have dense reference data in some places or having specific metrics which are able to measure distortions without reference data. Obtaining dense reference data is very costly to and can even be impossible if some of the classes can't be identified by image interpretation (we are not tagging images of cats or road signs!). Developing specific metrics for spatial quality which don't need reference data is an open problem. Some solutions have been developed for the assessment of pan-sharpening algorithms, but the problem is rather different.
Finally, I hope that this critical analysis of the TiSeLaC contest will be useful for future contests, because I think that they may be very useful to get together the remote sensing and the machine learning communities.