Machine learning benchmarking for land cover map production

Land cover map validation is a complex task. If you read French, you can check this post by Vincent Thierion which shows how the 2016 LC map of France produced by CESBIO stands with respect to data sources independent from those used for its production. But this is only one aspect of the validation. A land cover map is a map, and therefore, there are other issues than checking if individual points belong to the correct class. By the way, being sure that the correct class is known, is not so easy neither.


In this epoch of machine learning hype 1, it is easy to fall in the trap of thinking that optimising a single metric accounts for all issues in map validation. Typical approaches used in machine learning contests are far from enough for this complex task. Let's have a look at how we proceed at CESBIO when we assess the quality of a LC map produced by classification.


Supervised classification, training, testing, etc.

The iota2 processing chain is highly configurable in terms of how the images are pre-processed, how the reference data is prepared, how the classifiers are paremeterised, etc. We also continuously add new approaches. During this development work, we need to assess whether a change to the workflow performs better than the previous approaches. In order to do this, we use standard metrics for classification derived from the confusion matrix (Overall Accuracy, κ coefficient, F-Score). The confusion matrix is of course computed using samples which are not used for the training, but we go a little further than that by splitting the train and test sets at the polygon level. Indeed, our reference data is made of polygons which correspond to agricultural plots, forests, urban settlements, etc. Since images have a strong local correlation, pixels belonging to the same polygon have a high likelihood of being very similar. Therefore, allowing a polygon to provide pixels for both the training and test sets yields optimistic performance estimations.


Most of our tests are performed over very large areas (at least 25% of metropolitan France, often more than that), which means that, using reference data from Corine Land Cover, we have many more samples than we can deal with. Even in this situation, we perform several runs of training and testing by drawing different polygons for each run, which allows us to estimate confidence intervals for all our metrics and therefore assess the significance of the differences in performance between different parameter settings.


All this is good and well, but this is not enough for assessing the quality of the results of a particular algorithm.


Beyond point-wise validation

The data we feed to the classifier are images and they are pre-processed so that application agnostic machine learning approaches can deal with that. In iota2, we perform eco-climatic stratification, which can introduce artifacts around strata boundaries. We also perform temporal gapfilling followed by a temporal resampling of all data so that all the pixels have the same number of features regardless of the number of available clear acquisitions. After that, sometimes we compute contextual features which take into account the neighbourhood of the pixels, in Convolutional Neural Networks, a patch size is defined, etc.


All these pre-processing steps have an influence on the final result, but most of the time, their effect can't be observed on the global statistics computed from the confusion matrix. For instance, contextual features may produce a smeared out image, but since most of the validation pixels are inside polygons and not on their edges, the affected pixels will not be used for the validation. In our case, the reference data polygons are eroded in order to compensate for possible misregistrations between the reference data and the images. Therefore, we have no pixels on the boundaries of the objects.


In our paper describing the iota2 methodology, we presented some analysis of the spatial artifacts caused by image tiling and stratification, but we lack a metric for that. The same happens when using contextual features or CNNs. The global point-wise metrics increase when the size of the neighbourhoods increase, but the maps produced are not acceptable from the user point of view. The 2 images below (produced by D. Derksen, a CESBIO PhD candidate) illustrate this kind of issues. The image on the right has higher values for the classical point wise metrics (OA, κ, etc), but the lack of spatial accuracy is unacceptable for most users.


Even if we had an exhaustive reference data set (labels for all the pixels), the number of pixels affected by the over-smoothing are a small percentage of the whole image and they would just weight a little in the global metrics. We are working on the development of quantitative tools to measure this effects, but we don't have a satisfactory solution yet.


How good is your reference data?

All what has been said above does not consider the quality of the reference data. At CESBIO, we have learned many things over the years about the different kinds of impacts of the quality of reference data, both in the classifier training and the map validation step. We have people here who collect data on the field every year on hundreds of agricultural plots. We have also a bit of experience using off-the-shelf reference data. The quality of the results is much better when we use the data collected by our colleagues and we have a rather good understanding on what happens during training and validation. Ch. Pelletier recently defended her PhD and most of her work dealt with this issue. For instance, she analysed the impact of mislabelled reference data on the classifier training and showed that Random Forests are much more robust than SVM. She also developed techniques for detecting errors in the reference.

We also use simple ways to clean the reference data. For instance, when using Corine Land Cover polygons which have a minimum mapping unit (MMU) of 25 hectares, we use information coming from other data bases, as described from slide 34 in this presentation. An illustration of the results is shown below.

The reasons for having label noise in the reference data can be many, but the 2 main we face are: the MMU and the changes occurred since the collection of the reference data.


For our 2016 map, we used Corine Land Cover 2012, and therefore, we may assume that more than 5% of the samples are wrong because of the changes. Therefore, when validating with this data, if for some classes we have accuracies higher than 95%, we must be doing something wrong. If we add the MMU issue to that, for the classes for which we don't perform the cleansing procedure illustrated above, accuracies higher than 90% should trigger an alarm.


Our ML friends like to play with data sets to improve their algorithms. Making available domain specific data is a very good idea, since ML folks have something to compete (this is why the work for free for Kaggle!) and they provide us with state of the art approaches for us to choose from. This is the idea of D. Ienco and R. Gaetano with the TiSeLaC contest: they used iota2 to produce gapfilled Landsat image time series and reference data as the ones we use at CESBIO to produce our maps (a mix of Corine Land Cover and the French Land Parcel Information System, RPG) and provided something for the ML community to easily use: CSV files with labelled pixels for training and validation.


The test site is the Reunion Island, which is more difficult to deal with than metropolitan France mainly due to the cloud cover. Even with the impressive (ahem …) temporal gapfilling from CESBIO that they used, the task is difficult. Add to that the quality of the reference data set which is based on CLC 2012 for a 2014 image time series, and the result is a daunting task.


Even with all these difficulties, several teams achieved FScores higher than 94% and 2 of them were above 99%. It seems that Deep Learning can generalise better than other approaches, and I guess that the winners use these kind of techniques, so I will assume that these algorithms achieve perfect learning and generalisation. In this case, the map they produce, is perfect. The issue is that the data used for validation is not perfect, which means that an algorithm which achieves nearly 100% accuracy, not only has the same amount of error than the validation data, but also that the errors are exactly on the same samples!


I don't have the full details on how the data was generated and, from the contest web site, I can't know how the different algorithms work 2, but I can speculate on how an algorithm can achieve 99% accuracy in this case. One reason is over-fitting3, of course. If the validation and training sets are too similar, the validation does not measure generalisation capabilities, but it rather gives the same results as the training set. Several years ago, when we were still working on small areas, we had this kind of behaviour due to a correlation between the spatial distribution of the samples and local cloud patterns: although training and test pixels came from different polygons, for some classes, they were close to each other and were cloudy on the same dates and the classifier was learning the gapfilling artifacts rather than the class behaviour. We made this mistake because we were not looking at the maps, but only optimising the accuracy metrics. Once we looked at the classified images, we understood the issue.



In this era of kaggleification of data analysis, we must be careful and make sure that the metrics we optimise are not too simplistic. It is not an easy task, and for some of the problems we address, we don't have the perfect reference data. In other situations, we don't even have the metrics to measure the quality.

The solutions we use to solve mapping problems need an additional validation beyond the standard machine learning metrics.



Please correct me if my assumptions are wrong!


Deep neural networks are able to fit random labels by memorising the complete data set.

Canigou 3D

Lo Canigó és una magnòlia immensa
que en un rebrot del Pirineu se bada
- Jacint Verdaguer i Santaló


The Canigó is an immense magnolia
that blooms in an offshoot of the Pyrenees


3D view of the Canigou on 19-Dec-2017 (with a fancy tiltshift effect)

Continue reading

Premières validations de la carte d'occupation du sol OSO

En 2017, le Centre d'Expertise Scientifique OSO (Occupation du SOl) par l'intermédiaire du CESBIO a produit une carte d'occupation du sol de l'année 2016 à l'échelle du territoire métropolitain français et corse. On l'appelle la carte d'occupation du sol OSO ! Cette carte est le résultat de traitements automatiques massifs de séries temporelles d'images satellites optiques Sentinel-2. Comme les images Sentinel-2, cette carte a une résolution spatiale de 10 m correspondant à une unité minimale de collecte (UMC) de 0.01 ha. L'occupation du sol est décrite grâce à 8 classes au premier niveau et 17 classes à second niveau de détail, définies en fonction des potentialités de détection de l'imagerie Sentinel-2 et des besoins exprimés par des utilisateurs finaux. Ces classes couvrent les grands thèmes d'occupation du sol (surfaces artificialisées, agricoles et semi-naturelles).

Son principal avantage en comparaison avec d'autres cartes d'occupation du sol existantes, (loin de nous l'idée de les critiquer) est son exhaustivité territoriale et surtout sa fraîcheur ! Disposer d'une carte d'occupation du sol exhaustive sur l'ensemble du territoire national au premier trimestre de l'année suivante, c'est ce qu'OSO vous propose !

Quelle richesse thématique ?

Les classes détectées par télédétection sont celles du second niveau, celles du premier niveau sont obtenues par agrégation des classes du second niveau :

  • Culture annuelle
    • Culture d'hiver
    • Culture d'été
  • Culture pérenne
    • Prairie
    • Verger
    • Vigne
  • Forêt
    • Forêt de feuillus
    • Forêt de conifères
  • Formation naturelle basse
    • Pelouse
    • Lande ligneuse
  • Urbain
    • Urbain dense
    • Urbain diffus
    • Zone industrielle et commerciale
    • Surface route / asphalte
  • Surface minérale
    • Surfaces minérales
    • Plages et dunes
  • Eau
    • Eau
  • Glaciers et neiges éternelles
    • Glaciers et neiges éternelles

Avec quelle qualité ?

Valider une carte d'occupation n'est pas une procédure simple. Il s'agit de s'interroger sur :

  • la spécification des classes
  • l'échelle de validation
  • le jeu de données de validation

Dans tous les cas, il est rarement possible d'établir une validation exhaustive sur l'ensemble d'un territoire. Classiquement, une validation statistique permet d'appréhender partiellement la précision de la cartographie obtenue, et ne permet pas d'identifier l'ensemble des confusions thématiques et des erreurs géométriques de classification.

La suite de cet article tente de qualifier la précision de la carte d'occupation du sol OSO de 2016 grâce à des jeux de données de partenaires du CES OSO. Une première validation, intrinsèque au processus de classification, a été effectuée. Les résultats statistiques sont visibles ici.

Le jeu de données d'échantillons de la couverture de surface a été produit grâce à des bases de données nationales telles que la BD Topo, le Registre Parcellaire Graphique (RPG) et Corine Land Cover. 70% de ces échantillons ont été utilisés pour l'apprentissage et 30% pour la validation a posteriori visible sur la figure ci-dessous. Cette validation, bien que pertinente, s'appuie sur des échantillons dont la génération suit la même procédure que les échantillons d'apprentissage, biaisant quelque peu l'indépendance de la validation.

Validation de la carte d'occupation du sol OSO avec 30% des échantillons extraits des 3 jeux de données utilisés lors de la classification - BD Topo, Registre Parcellaire Graphique et Corine Land Cover)

De plus, il nous était impossible de valider les deux cultures annuelles de la classification. En effet, l'indisponibilité du RPG pour l'année 2016 et 2015 (toujours indisponible le jour de l'écriture de cet article), nous a amené à développer une méthode d'apprentissage basée sur le principe de l'adaptation de domaine utilisant des échantillons du RPG 2014. Cette méthode est très bien expliquée ici. Quoiqu'il en soit, il nous était impossible de valider la classification des cultures d'été et d'hiver de 2016, seuls des échantillons issus du terrain nous le permettait, en voilà la preuve !

Continue reading

Venµs à l'honneur en Haute-Garonne et en Ariège en 2018

Le satellite Franco-Israelien Venµs, attendu depuis si longtemps, a été lancé le 2 août 2017. 110 sites dans le monde vont être observés en 2018 et 2019 à 10 m de résolution et avec 12 bandes spectrales. Alors que la plupart des sites ne correspondent qu’à l’emprise d’une scène Venµs (27 à 32 km de large (est-ouest) * 27 km nord-sud), le site ‘Toulousain’ couvre un transect de 168 km du nord de la Haute-Garonne (Grenade) jusqu’en Espagne, en passant par les Pyrénées ariégeoises (dont le Mont Vallier), prolongé par un 2ème transect de 157 km de long en Espagne jusqu’à l’embouchure de l’Ebre (carte en ligne).


L’intérêt d’avoir choisi un si grand transect est la grande diversité des conditions pédo-climatiques due au relief varié de la zone, des types de cultures et de végétation et enfin de pratiques humaines de gestion (type d’agriculture, d’élevage…), sur un nombre de kilomètres assez restreint. Ce transect Venµs permettra ainsi d’étudier de nombreux agro-écosystèmes différents.

Le transect Venµs, de Toulouse à l'espagne.

L’intérêt majeur de la mission scientifique Venµs est d’offrir une très forte revisite temporelle : chaque site sera observé tous les 2 jours. En combinant les données de Venµs avec celles de Landsat 8 et Sentinel-2, la revisite sera presque quotidienne. Au niveau scientifique, il s’agit de préparer les futures missions spatiales opérationnelles et de démontrer l’intérêt d’une fréquence temporelle très élevée. Au niveau thématique, ces 2 années 2018 et 2019 vont permettre de suivre finement les évolutions rapides des phénomènes naturels comme les variations du manteau neigeux, la croissance des cultures, les stades phénologiques des diverses végétations (forêts, prairies, cultures, autres milieux naturels), etc… Pour être pleinement valorisés, ces sujets nécessiteront des observations de terrain de qualité sur ces deux années 2018 et 2019. Nous faisons donc acte d’information, voire d’appel à volontaires, pour collecter des données de terrain pertinentes. Ci-dessous, nous listons les principaux sujets déjà prévus ou potentiels, pour chacune des 2 grandes zones géographiques du transect ; ainsi que les principaux acteurs pré-identifiés.


Continue reading

Using multi-temporal high-resolution remote sensing in surface modeling

Version française par ici.


The land surface models simulate the water and energy fluxes between soil, land cover and atmosphere. Their scopes of application spread from numerical weather prevision to soil water modeling.


Energy and water budgets of the soil-plant continuum


However, these models are initially conceived to be applied on wide areas. Thus, they use low resolution cover parameters (>1km) derived from mid-resolution satellite observations (MODIS, VEGETATION). These parameters are mainly the Leaf Area Index (LAI), the vegetation type or the surface albedo. Yet the agricultural landscapes of Western Europe are characterized by a patchwork of plots smaller than one square kilometer. These plots have very different vegetation cycles, i.e. winter and summer crops, which could only be described at high resolution. The crop management practices like crop rotation or irrigation are also generally not taken into account.


The products of the Sentinel-2 space mission, with their high spatial and temporal resolution, could bring elements to fill in this missing information.

Continue reading

Les séries temporelles d'images à haute résolution pour la modélisation des échanges de surface

English version right here.


Les modèles de surface simulent les échanges d’eau et d’énergie entre le sol, le couvert (végétal ou non) et l’atmosphère. Leurs  applications vont de  la prévision numérique du temps à la modélisation de l’état hydrique des sols.


Bilans d’eau et d’énergie du continuum sol-plante


Cependant, ces modèles, initialement conçus pour simuler de grandes étendues, utilisent des paramètres du couvert à basse résolution (au-delà du kilomètre) qui sont issus des observations satellites à moyenne  résolution (MODIS, VEGETATION). Ces paramètres sont principalement l’indice foliaire, le type de végétation, l’albédo de la surface. Or, les paysages agricoles d’Europe de l’Ouest sont caractérisés par un patchwork de cultures avec des parcelles bien inférieures au kilomètre carré et des cycles de végétation très différentes (cultures d’été, cultures d’hiver, …) qui ne peuvent être décrits qu’à haute résolution. De plus, les pratiques culturales (dites anthropiques car liées à une action de l’homme) comme la rotation des cultures ou l’irrigation, ne sont également généralement pas prises en compte.


Les produits satellite à haute résolution spatiale et temporelle issus de la mission Sentinel-2 peuvent contribuer à palier à ces lacunes puisqu’ils ont une résolution inférieure à l’échelle de la parcelle.

Continue reading

Building a global cropland mask, is not an easy task

Criticizing is easy, and doing is hard, especially when trying to create a global map of croplands. Some collegues from CESBIO have worked on that subject within the Sen2Agri project, and obtained good resuts, but only at the local or country scale. Finding a method that works everywhere must clearly be much harder.

These days, I have received a lot of emails, tweets and posts about a new cropland global product at 30 m resolution, edited by USGS. I have no doubt it was a serious work from a serious team, done with appropriate terrain data and methods, validation, and of course a tremendous data processing.



But there it is, I checked it over a lot of places that I know very well, and it seems to me that the cropland mask, at least in South West France, is clearly overestimated. Is it the same in tour region ? Here are some examples :

Continue reading

Yesterday's snow cover area in the Pyrenees

Olivier pointed to me that ESA's ground segment, PEPS and MUSCATE were all in really good shape today... And the sky was clear yesterday at the time of the Sentinel-2A acquisition!

So I could download the Level-2A product from, run our let-it-snow processor, start QGIS and here it is: the map of yesterday's snow cover area at 20 m resolution. If you know the region, you might notice that there is currently a big contrast in the snow cover extent between the French and the Spanish Pyrenees. This is due to the blocking of the moist air masses coming from the north.

Snow cover area on 22 Nov 2017. blue: snow, grey: no snow, white: cloud.

Stay tuned! Theia should start to distribute these Sentinel-2 snow products in near real time very soon.

L'Antarctique de nouveau sous l'œil de Sentinel-2

Après un hiver pleins de rebondissements en Antarctique, j'avais écris "il est temps que la luminosité revienne pour que les acquisitions Sentinel-2 redémarrent !".

Ça y est ! Le paysage qui se découvre petit-à-petit sous l’œil de Sentinel-2 est toujours aussi somptueux et pleins de surprises. Par exemple, la première image claire de la langue flottante du glacier de l'Île du Pin montre l'iceberg géant qui s'est décroché en septembre. Nous avions suivi l'évolution de la fissure à l'origine de ce détachement l'été dernier avec Sentinel-2. En zoomant, on peut apercevoir le bleu éclatant de la glace comprimée d'un iceberg renversé. Ce mini iceberg couvre quand même une surface équivalente à 20 terrains de football.

Continue reading