All what has been said above does not consider the quality of the reference data. At CESBIO, we have learned many things over the years about the different kinds of impacts of the quality of reference data, both in the classifier training and the map validation step. We have people here who collect data on the field every year on hundreds of agricultural plots. We have also a bit of experience using off-the-shelf reference data. The quality of the results is much better when we use the data collected by our colleagues and we have a rather good understanding on what happens during training and validation. Ch. Pelletier recently defended her PhD and most of her work dealt with this issue. For instance, she analysed the impact of mislabelled reference data on the classifier training and showed that Random Forests are much more robust than SVM. She also developed techniques for detecting errors in the reference.
We also use simple ways to clean the reference data. For instance, when using Corine Land Cover polygons which have a minimum mapping unit (MMU) of 25 hectares, we use information coming from other data bases, as described from slide 34 in this presentation. An illustration of the results is shown below.
The reasons for having label noise in the reference data can be many, but the 2 main we face are: the MMU and the changes occurred since the collection of the reference data.
For our 2016 map, we used Corine Land Cover 2012, and therefore, we may assume that more than 5% of the samples are wrong because of the changes. Therefore, when validating with this data, if for some classes we have accuracies higher than 95%, we must be doing something wrong. If we add the MMU issue to that, for the classes for which we don't perform the cleansing procedure illustrated above, accuracies higher than 90% should trigger an alarm.
Our ML friends like to play with data sets to improve their algorithms. Making available domain specific data is a very good idea, since ML folks have something to compete (this is why the work for free for Kaggle!) and they provide us with state of the art approaches for us to choose from. This is the idea of D. Ienco and R. Gaetano with the TiSeLaC contest: they used
iota2 to produce gapfilled Landsat image time series and reference data as the ones we use at CESBIO to produce our maps (a mix of Corine Land Cover and the French Land Parcel Information System, RPG) and provided something for the ML community to easily use: CSV files with labelled pixels for training and validation.
The test site is the Reunion Island, which is more difficult to deal with than metropolitan France mainly due to the cloud cover. Even with the impressive (ahem …) temporal gapfilling from CESBIO that they used, the task is difficult. Add to that the quality of the reference data set which is based on CLC 2012 for a 2014 image time series, and the result is a daunting task.
Even with all these difficulties, several teams achieved FScores higher than 94% and 2 of them were above 99%. It seems that Deep Learning can generalise better than other approaches, and I guess that the winners use these kind of techniques, so I will assume that these algorithms achieve perfect learning and generalisation. In this case, the map they produce, is perfect. The issue is that the data used for validation is not perfect, which means that an algorithm which achieves nearly 100% accuracy, not only has the same amount of error than the validation data, but also that the errors are exactly on the same samples!
I don't have the full details on how the data was generated and, from the contest web site, I can't know how the different algorithms work , but I can speculate on how an algorithm can achieve 99% accuracy in this case. One reason is over-fitting, of course. If the validation and training sets are too similar, the validation does not measure generalisation capabilities, but it rather gives the same results as the training set. Several years ago, when we were still working on small areas, we had this kind of behaviour due to a correlation between the spatial distribution of the samples and local cloud patterns: although training and test pixels came from different polygons, for some classes, they were close to each other and were cloudy on the same dates and the classifier was learning the gapfilling artifacts rather than the class behaviour. We made this mistake because we were not looking at the maps, but only optimising the accuracy metrics. Once we looked at the classified images, we understood the issue.