Resources

Refine your data to get the most out of machine learning

Previously we posted about the need to collect training data. Now, what do you do with it?

If you adhere to Clive Humby’s famous notion that “data is the new oil,” then you know what comes next: refinement.

While the goal is to train a machine learning model (and ultimately address your business case), moving directly from raw data to modeling is like putting unrefined crude oil into a jet engine. You won’t like the outcome.

Review Your Data

The refinement process begins by taking a step back and reviewing what you have gathered (and are hopefully continuing to gather throughout this process!).

At a minimum, you should collect and review metrics such as:

The number of examples collected
A breakdown of your target variable (a.k.a. the predictions that you want to train your model to return when it sees a similar example)
Number of null/empty values (if any)
The distribution of each of the data streams you are capturing

The purpose of doing this is to quickly identify any obvious problems in your data collection to avoid wasting resources on unusable data. The best set of metrics will depend on your particular use case.

Reviewing data can be time-consuming and tedious, especially video data. Nevertheless, skipping this step introduces significant unnecessary risk. I learned this lesson previously when working on a real-world video dataset that was supposed to contain people answering a series of questions. When reviewing the footage, we discovered many of the responses were nonsense (literally gibberish) with a lot of duplicates, forcing us to discard much of the gathered data at great cost to the company.

With that said, you generally do not need to review everything before starting to develop a model. Doing so is impractical when working with massive datasets and impossible when working with live data streams. In these scenarios, it is usually better to review what you have time to examine and move on unless your cursory review raises red flags. After all, once you have a model, you can use a variety of techniques to return to your dataset for further review without the up-front time investment. We will discuss these techniques in a future post.

Data Cleaning and Validation

If you are lucky (at least luckier than this author), your data will look great, and you can begin model development. More commonly, however, you will need to perform some processing to prepare your data for subsequent modeling.

Some common data cleaning tasks include replacing null and outlier values with sensible numbers and removing duplicates and other unwanted data. Cleaning can also involve digital signal processing and other machine learning techniques to filter out noise and distinguish good from bad data.

As with anything, it is generally best to start simple and add complexity as specific needs become apparent.

As part of this stage, it can be valuable to pass your data through a data validation workflow that automatically flags and quarantines aberrant data for human review. The same validation flow can also be used upstream of your model in production, helping ensure that the data your model receives closely resembles your training data. This approach also allows you to monitor changes in your data over time, which can alert you to otherwise invisible indicators that you need to retrain your model on more up-to-date data.

Data Provenance

Especially at this stage of a machine learning project, it is important to be highly organized and document every action taken to review, clean, and otherwise prepare data for modeling. A practical way to obtain such comprehensive data provenance is via a combination of version control and detailed process logging. This works for multiple reasons:

First, once you have a model in production, you need to apply precisely the same data cleaning process before it hits your model, i.e., your data cleaning pipeline must be reproducible. Keeping your data transformation code in a centralized repository makes this straightforward.

Second, the next stage of the data science process, model development, usually involves building not one but multiple models. The total number of models depends on the complexity of the problem among other factors (including the project budget). Each of these experiments reveals more about the upstream data and may encourage you to rethink your data ingestion pipeline. Good code and data version control helps you make revisions without starting from scratch.

Finally, auditability is increasingly considered a core requirement of AI/ML systems. Monitoring and tracking model predictions over time can help uncover hidden biases and enable work toward improved future performance. While this has always been a requirement for AI/ML applications in fields like healthcare, it has clear benefits to organizations in any industry developing models that directly impact other people.

Conclusion

If any or all of the above sounds difficult, especially the last part about data provenance, you’re not alone. While model development and deployment increase your cloud computing bill – often significantly – data review and cleaning are often the most social capital-intensive processes in the data science workflow. Getting this stage right sets you up for success.

Rather than developing your own frameworks, it is often more strategic to use existing tools and platforms that have already incorporated the relevant capabilities for you and your team.

The Teknoir platform makes it easy to gather and clean data, particularly from embedded devices at the edge. It automatically logs inbound data and enables users to append any information required for compliance and auditing.

These features in combination with the ability to (a) deploy custom data pipelines and models and (b) develop low code workflows that clean, transform, and augment data from the edge provide a robust jumping-off-point for any AI/ML initiative.

Let us know how we can support your next embedded ML project!

Written By: Michael Bell, Ph.D.

Close Tab