Most companies embarking on an AI/ML journey quickly come to the realization that processes cannot be automated without gathering data first. Even when developing rule based algorithms rather than statistical models to make predictions, you and your algorithms need to understand the breadth of possible data coming into your system to properly handle data in production. In data science, we call this “training data” because we are literally using it to train a model how to respond to input data, and not just any data will work.
To perform well, machine learning models generally require training data that resembles the data it will see when deployed in the real world.
Take, for example, a people-tracking use case like the one depicted below.
The best way to achieve great model performance is to gather training data from a camera deployed on-site in precisely the same place you intend to have it deployed long term. You can then start to record footage and annotate (a.k.a. label) people you see in the video frames. The annotations/labels and raw image data together can be used to train a model to detect and track people automatically
Not only would this training data match the camera position/angle of your production use case, but you might also see lighting changes throughout the day, which is likely to impact your model.
This speaks to another key factor in high-quality training data: diversity.
The best training sets contain broad coverage of as many conceivable examples as possible without duplicates. In the case of people-tracking, if you need to detect people facing the camera and away from the camera, many examples of both cases should be in your training data.
Good training data must also be labeled accurately and precisely. In the people-tracking example, accurate labels mean that every bounding box you draw labeled “Person” actually contains a person, and ideally, only one. Precise labeling means your bounding boxes are tight around the people you identify. If you include a lot of padding around your object of interest (people, in our case), your training labels become less meaningful and will likely degrade the performance of your model.
One way to obtain accurate labels is by asking multiple annotators to label the same data. While you could have multiple members of your team do this, you could also leverage a data annotation service to complete this tedious task.
You know you need to collect training data, but how much do you really need? The answer is that it depends.
In the case of people-tracking, a good starting point is approximately 1000 labeled examples. If you are detecting multiple objects, try to collect 1000 examples of each. This may be more than enough or far too few depending on image conditions (or data complexity in non-image use cases).
Generally, more training data is always better, but sometimes you simply cannot gather thousands or even hundreds of examples. In these scenarios, “data augmentation” techniques can be used to artificially expand the amount of data you have gathered.
In the example below, a single image has been augmented to create 9 images via combinations of scaling, translation, rotation, brightness adjustments, and mirroring. This technique is useful even when you have lots of training data available.
You can also minimize the amount of training data you need to collect by using a technique called “transfer learning” where you leverage what models have learned from MUCH larger datasets (i.e. with 100,000 – 1,000,000+ examples). This is a very powerful technique, and most practical projects can benefit from this approach when a starting model is available. Data science platforms like Teknoir make this process easy.
It can be tempting to use one of the many pre-trained models available in place of gathering a training dataset. These models are fantastic for proof-of-concept development, but you can minimize unanticipated, unforeseeable problems down the road by fine-tuning these models on your own real data prior to using them in production.
The Teknoir platform connects users to many available pre-trained models that serve as proofs-of-concept and expedite data annotation for custom application development. The Teknoir platform was designed to simplify deploying one or multiple devices to gather training data for predictive modeling. Likewise, the Teknoir Label Studio and pre-configured model training workflows accelerate the timeline to obtain and learn from training data, ultimately achieving objectives faster. Then, once the model is deployed, the platform monitors and tracks model performance and updates as needed to ensure ongoing high-quality performance.
As discussed above, gathering training data is the first and most important part of any successful AI/ML initiative. Let us know how we can help you begin this process and make the most out of the data you collect!
Written by: Michael Bell, PhD