A Perfect Machine Learning Training Data Set?

Above background image/labels: GFDRR Labs (2020). “Open Cities AI Challenge Dataset”, Version 1.0, Radiant MLHub. April 2022, https://doi.org/10.34911/rdnt.f94cxb

Seeking 100% and Lessons Learned Along the Way

This is the third part of a multi-part series on the ramp project. For more background on the project and an introduction to the training process, be sure to check out our first two blogs here:

What is good enough?

“What is good enough” is a living question that has evolved over the span of the ramp project. It starts at the end, in defining our use case and engaging end-users as advisors to get an idea of how the model outputs will be utilized. Working backward in this way has allowed us to craft our approach to establishing training data guidance and quality assurance measures that will influence model accuracy in ways that align with end-user needs. This means exploring questions like; What is the minimum accuracy required for the ramp model and building data outputs to be useful for each different use case? Does the end user need high accuracy population estimates, or only to identify rural built-up areas and villages? Does over-extraction or under-extraction pose a bigger problem for our users? Different model applications require different degrees of output aggregation and varying image resolution in the initial training data. For the ramp project, we are targeting 90-100% in training data accuracy, expecting a <=10% margin of error to be acceptable. Higher accuracy in our training data translates directly to higher model performance and better, more valuable outputs. Seeking 100% is not about actually getting our model to 100% through perfect training data. It is about doing all that we can within the scope of the project to get the model to perform the best. That means not cutting corners and figuring out ways that we can streamline our process in creating, editing, and ingesting data to train the baseline and fine-tuned ramp model.

The bottom line is that a deep learning model like ramp needs tens of thousands of training data tiles (in our case 256 pixel x 256 pixel tiles) and ingests hundreds of thousands of digitized building polygons throughout the training process. All of these data must be initially created and curated by human analysts to ensure accuracy and consistency, introducing not only a large volume of manual labor, but also many questions regarding quality, specifications, and prioritization.

So, what is good enough? That depends on the use case and available resources. The ramp project is currently focused on using rapid building detection and delineation to support emergency and disaster response, as well as microplanning for public health initiatives. “Good enough” for our purposes was defined through thoughtful interaction with the end users and stakeholders. These experts provided insight into how these data would actually be used. Since representation is critically important for these use cases, we determined that over-extraction of buildings with more false positives (higher recall, lower precision) is preferable to under-extraction with a higher percentage of accurate detections (higher precision, lower recall) to ensure no structures or settlements are overlooked. By focusing on the end uses and applications of the outputs, we have shaped our approach to training data and more clearly defined the components and potential pitfalls in our pursuit of 100%.

Now how do you get to and exceed good enough? What follows is our attempt at shedding some light onto this question in our pursuit to do all that we can to produce a high-quality training data set for the ramp model.

How can we ensure consistency and assess quality efficiently?

Label Guidance

Due to the mass of data that it takes to train a deep learning model, often teams of labelers will work together to generate a single dataset. Inevitably, this opens the door for variance in labeling as each labeler will naturally be subject to their own perception and experience. Below is an example of this variance, seen after a qualitative test was run between two labelers who were given the same simple guidance to “label building rooftops” over three of the same tiles. The blue polygons represent labels created by one labeler, and the red outlines are the labels captured by the other. These differences, seen between the two sets of labels, are harmful to model accuracy and need to be limited as much as possible.

Maxar Open Data Program image over Dhaka, Bangladesh with polygon labels over buildings. Provided with CC BY-NC 4.0 license. 

Ramp labeling guidance decision tree

Knowing that it is critical that labels are collected in a consistent, predictable manner for effective model training, we put together a comprehensive guidance document to be utilized throughout the labeling process. The guidance outlines the expected label format for building footprint extraction data and provides examples and tools such as a label quality decision tree to be leveraged in producing high quality training data. It takes into consideration ramp’s specific use cases, digital microplanning and disaster response, and is tailored for the geographies where the model will be used. The guidance has been compiled through an iterative and ongoing process, incorporating feedback from partners and advisors as well as our own observations from labeling and reviewing the data.

Whether fine tuning the ramp model over a specific region of interest, or in developing a new model, this guidance has been released to be utilized, edited, and built upon by the open community. If you would like to give our labeling guidance a closer look, head over to the ramp landing page and check it out.

Label Review & Edit Tool – Chippy Checker

As mentioned in our last blog, we have also been leveraging existing open source training data sets that have been compiled for other AI projects to supplement the labels that we are creating. These data sets introduce complexity to the labeling pipeline because they have often been created in accordance with different labeling guidance, and in the case of Open Cities, a different imagery source altogether (drone versus satellite). There is no right or wrong guidance as different models have different use cases, but it creates the need to review and often edit the data to be usable in training the ramp model.

We have found that it is often quicker to review and edit these existing data sets than to create our own labels. This of course can vary on a case-by-case basis, but generally we have been able to internally tune around 100 edited tiles for the same level of effort it takes to create 70 new labels. This is primarily the case due to an efficient tool, coined the “Chippy Checker”, that enables us to efficiently streamline the review and editing of individually labeled tiles.

Open Cities Data over Accra, Ghana provided with CC-BY-4.0 license 

The tool works in QGIS, which is an open-source geographic information system. The user interface of QGIS is recognized by many across the open community, and our tool works in tandem with the native functionality of the software. Chippy Checker pulls matching image .tiff and label .geojson files into the software, and allows users to quickly accept or reject a labeled tile. By utilizing QGIS’ editing tools, users can also adjust labels before accepting a tile, automatically saving the edited output to a new file directory and spreadsheet.

We’ve used this simple tool to step through and edit thousands of labeled tiles. It has been an integral step in our training data pipeline and we’re looking at releasing it to the open community so others can benefit from it as well.

Do you have thoughts about our training data process that you would like to share with us? Please reach out! We are learning and developing our pursuit of “perfection” as we go and appreciate your feedback. Also keep an eye out for our next blog in the series to continue the conversation about the ramp project.