Diving further into data collection: Data pipelines, Training data, Dataset annotation

This article explores additional techniques employed by data engineers to enhance the quality of subsequent machine learning models

Model accuracy can be improved in two key ways:

  1. Gather more data or refine the existing data, ensuring the model's consistency.

  2. Adopt cutting-edge algorithms or tune model hyperparameters while maintaining a consistent dataset.

The first strategy is known as Data-centric, and the second as Model-centric. Currently, the ML community is gravitating towards a Data-centric AI approach. Numerous researchers and practitioners have concluded that enhancing data quality significantly boosts accuracy more than algorithm optimization. The adage "Garbage in, garbage out" is gaining a fresh perspective today.

Businesses should focus not on coding, but on developing systematic, reliable, and effective methods to improve data quality. In essence, businesses should shift from a Model-centric to a Data-centric approach.

Organizations creating high-quality AI products also adhere to the Data-centric philosophy. A notable AI division leader at a prominent electric vehicle manufacturer once shared that his role predominantly involved data handling. Data-centric AI has become so popular that it has evolved into a distinct discipline focused on techniques for improving dataset quality.

Data Pipelines

Data is ubiquitous, spanning system-generated logs, banking transactions, website data, user-entered information, and customer data.

Often, this data arrives in a chaotic, unstructured, and uncleaned state. We receive information from various sources that are challenging to merge. Sometimes, the data is encrypted or missing snippets. It can come in the form of byte streams, text files, tables, images, audio, and video recordings, and can be binary or human-readable.

Initially, this data must be processed, transformed, cleaned, aggregated, and stored before data scientists and ML engineers can utilize it.

A pipeline organizes the data flow

The ETL (Extract - Transform - Load) approach is quite common in analytics and machine learning. Here's how to work with ETL:

  1. Identify the data you wish to collect and its sources.

  2. Merge these sources, transform the data into the chosen format, eliminate inconsistencies, and correct errors.

  3. Design a storage system for the processed and cleaned data.

  4. Automate the entire process to run without employee intervention. Data pipelines should automatically initiate at set intervals or following specific events.

This overview briefly touches upon data pipelines, a topic that encompasses a wide array of nuances. An increasing number of companies are hiring data engineers to manage storage and pipelines, allowing data scientists and ML experts to focus on analysis and modeling.

Another critical point before we discuss training data and annotation is that properly configured pipelines enable companies to benefit from data even without advanced machine learning. Thus, companies typically start with reports, metrics, and basic analytics before transitioning to more sophisticated ML approaches.

Training Data

To train a "cat/dog" classifier, it’s essential to feed the model numerous images of cats and dogs, labeling each accordingly. Without setting explicit rules or explanations, you allow the model to independently determine the features to focus on for making predictions. Mathematically, this means the model adjusts its parameters until the input data matches the expected output during training.

The model's understanding of the world is built upon the training data, assuming these data accurately reflect reality. Thus, the quality of training data is paramount.

A "cat or dog" model will struggle to identify specific breeds or correctly classify other animal types if such information was absent from the training dataset. If labels contain errors, such as mislabeling cats as dogs and vice versa, the model will get confused, impairing its accuracy. Non-random errors can severely damage the model. For example, if all Chihuahuas are labeled as cats, the model will learn to recognize Chihuahuas as cats. Real-world data comes with biases. For instance, if data suggests women are paid less, a model trained to predict company salaries might learn and perpetuate this disparity. If certain classes or segments are underrepresented or missing in the training dataset, the model will fail to learn from them, resulting in inaccurate predictions.

Training data must be relevant, uniform, representative, and comprehensive. Here are some tips to enhance data quality.

Before collecting data, understand the business problem, then formulate it as a machine learning task: what to predict and based on which data. Any business challenge can be framed differently, depending on requirements and constraints. For example, in computer vision projects, I often choose between object detection, segmentation, or classification, adjusting the number of classes accordingly.

Training data should closely resemble what the model will encounter in production

In theory, models can generalize learned knowledge to unseen data, but in reality, their generalization capabilities are limited. For instance, a computer vision model trained on interior images will underperform on outdoor objects. A sentiment analysis model trained on tweets will struggle with classical literature texts. I’ve encountered cases where computer vision models barely generalized even with minor differences, such as changes in lighting, skin tones, weather conditions, and compression methods. A popular approach to bridging the gap between training and operational data is using recent production data as the training dataset.

A small, clean dataset is preferable to a large, dirty one

Data annotation is a significant challenge for most projects. Labeling data is an extremely complex, slow, and costly process. Only IT giants can afford massive, cleaned datasets. Others must choose between size and quality, where quality should always take precedence, especially for datasets used in model evaluation.

No one can definitively say how much data is needed

It depends on the complexity of the phenomenon being predicted, the variability of the training data, and the required accuracy of the model. The only way to find out is through trial and error.

Collect data in batches

Start with a small dataset, label it, check its accuracy, analyze errors, and plan the next iteration for data collection and labeling.

Training data is not static. Remember, you will need to train and retrain the model numerous times, both during research and after launching it in production. With each new iteration and model update, a new training dataset is required.

Data Labeling

Most models deployed in production today have been prepared through supervised learning, meaning they require labeled data for training and evaluation. Even in unsupervised learning scenarios, where models learn patterns and structures from unlabeled data, labeled data is still essential for assessing model accuracy. Without it, how can you determine if the model is ready for production deployment?

There are two types of labels: manually created and "organic."

In some machine learning tasks aimed at predicting future events—such as stock prices, customer churn, arrival times, fraudulent transactions, and recommendations—the correct label becomes apparent in the future. These labels are referred to as organic, and our job is merely to collect them as they emerge.

In fields like computer vision and natural language processing, we're not predicting the future but classifying, analyzing, and extracting information from images and texts. Hence, we can't rely on organic labels and must primarily depend on manual labeling.

Manual data labeling is an extremely complex, slow, and expensive process. Treat it not just as a task within an ML project but as a separate data annotation project, with its own scope, timelines, budget, team, tools, and KPIs.

Firstly, decide who will do the labeling

There are three options: crowdsourcing, working with vendors, or using your own team. Crowdsourcing platforms like Amazon Mechanical Turk were once hailed as game-changers, but it soon became evident that crowdsourced labeling is only suitable for very simple tasks requiring little to no training. Thus, most companies choose between outsourcing and in-house labeling teams. Startups tend to lean towards service providers for ease of startup, while large AI companies often establish their own data labeling departments to control the process and ensure high-quality annotations. For instance, a leading electric vehicle company has over a thousand employees in its manual data labeling department, just as an example.

Create and use guides to train annotators

These documents, filled with explanations and visual aids, explain how labeling should be performed. They then serve as training materials that annotators must study before tackling real data labeling tasks. If you're working with vendors, ensure they have a well-established training process.

The real world is ambitious and ambiguous, so don't chastise annotators when they say, "I don't know how to label this sample." Collect these ambiguous examples and use them to refine your guidelines.

The choice of annotation tool matters

Annotators are usually paid by the hour, so optimizing their work for speed and accuracy can save a lot of money. Whether an annotator can handle 100 samples an hour or 300 makes a significant difference in large-scale projects. Consider the following when choosing a tool:

  • How long does it take to label a single sample? There are tools specifically designed for natural language processing tasks, while others are meant for 2D or 3D computer vision.

  • Is AI-based labeling supported? These tools can predict segmentation masks or deploy custom models with a click, optimizing the labeling process.

  • How well does the tool integrate with your infrastructure? An annotation tool is part of the data pipeline. As data arrives, samples are automatically extracted and sent to annotators, who label them, and the labels are automatically stored in a database. Some tools will fit your infrastructure better than others.

Estimate costs and timelines

Data labeling can be surprisingly slow and expensive, so prepare yourself and your management for this reality.

Here's a rough formula to estimate costs and time:

  • Labeling time (in person-hours) = time to label a sample (in hours) × dataset size (in samples) + time allocated for training and error correction (in person-hours).

  • Labeling time (in working days) = labeling time (in person-hours) / number of employees / 8 hours.

  • Expenses ($) = annotator's hourly rate ($) × time to label (in person-hours).

Despite your best efforts, data labels will still be imperfect

People make mistakes, get distracted, or misunderstand tasks, so label quality must be checked. And of course, the algorithm or tool chosen for this task must be integrated into the data pipeline. I can’t emphasize enough: everything should be automated.

One such tool is Cleanlab, developed by MIT alumni and recently gaining popularity. Cleanlab enhances the quality of labeling for images, text, and tabular data using statistical methods and machine learning algorithms (see Cleanlab's blog for its capabilities).

Need assistance with processing your product's data or integrating algorithms and services based on big data?

Feel free to book a free call with our CTO, or leave your contact details on our website. We'll answer all your questions, and if desired, develop and implement a custom algorithm or service of any complexity using big data into your product.

Last updated