blog posts

Machine Learning and the Data Annotation Industry

Lars Erik Schönander

February 11, 2020

Machine learning is rapidly becoming a major part of the products we consume and how we interact with governments, from neural networks that improved Google Translate to automated decision systems using decision trees for detecting potential SNAP fraud in New York City.

For machine learning to work, large datasets are required to be able to create the necessary predictions, especially if a data scientist is working on a classification problem. While many companies possess large stores of images, few possess labeled images, which leads to the problem of figuring out how to label the image data and what to do with it.

Labeled data is a sample, usually an image, that has been given a label. For example, a picture of a dog would be given the label “dog,” and the picture of a cat would be a given the label “cat.” As seen by the hot dog scene in Silicon Valley, using machine learning to correctly detect what objects are is much harder problem than what the average person would expect. Labeled images are required to create machine learning models to create future predictions on newly collected unlabeled datasets.

The driverless car industry depends very much on these labeled datasets to function. While the driverless cards may be collecting LIDAR, camera, and radar images of their surroundings, this data is unlabeled and someone needs to go into and provide labels to this collected data, so that is actually useful for companies like Waymo, Uber, or Tesla. As seen by Lyft’s open dataset on level 5 related driverless car data, the amount of data that needs to be labeled so driverless cars can make accurate predictions is large in scale. All the labeled data is required so when a driverless car is driving, it can make accurate predictions based on the new live data it’s collecting.

The need for labeled data for driverless cars is so large that startups that startups like Scale AI exist to provide companies like Lyft and Uber large amounts of data, which is labeled through a combination of machine learning and manual human labor for more ambiguous scenes.

This type of work, with names such as a data annotation, plays a major part in many parts of the driverless car industry, and beyond. Data annotation work can either be done through crowdsourcing on platforms like Mechanical Turk, where groups post tasks that need to be done, to companies like SamaSource, whose business modeled is hiring workers in developing countries to label data for companies like Google and Microsoft. An example data annotation that most people have experienced is, whenever you fill out a Google Captcha, what you’re actually doing is labeling images to help find edge cases within a given set of images. These images are then used by Google’s driverless car program to help the car more effectively navigate the road.

While crowdsourced data annotation has made classification problems possible, as ImageNet itself was labeled with Mechanical Turk workers, there are flaws to using crowdsourced or hired labor to manually label datasets, ranging from poorly labeled data, to labels that end up harming different groups of people.

Last September, a group of artists did a project based on the ImageNet labels. Quite a few of these images were debatable, like events portraying intimacy. The labels themselves were based on an older project, WordNet, which due to the culture of the time, had labels that could be perceived as biased towards minority groups. Many of these labels were crowdsourced through Mechanical Turk, leading to quite a few of these labels being phrases ranging from “divorced” all the way to racial slurs. Many of these labels and images related to those labels were taken out of WordNet and ImageNet due to outcry.

Data annotation is a key component of AI policy as one of the key innovations of AI, driverless cars, requires these datasets to be able to function properly. When data annotation goes wrong, it can lead to consequences such as lacking the ability to recognize pedestrians. On the side of economic impact, these roles create better paying jobs in less developing countries, and allow foreign workers to take on routine work that once required high skills to perform. AI policy also needs to focus on the less glamorous side, data labeling, to ensure that the foundations of AI are functioning properly and not causing any undue harm.