• Tue. Jun 18th, 2024

Why you don’t want massive info to educate ML


Apr 5, 2023
Why you don't need big data to train ML


Be part of top rated executives in San Francisco on July 11-12, to listen to how leaders are integrating and optimizing AI investments for accomplishment. Learn A lot more

When someone states artificial intelligence (AI), they most typically imply machine studying (ML). To produce an ML algorithm, most men and women believe you need to have to obtain a labeled dataset, and the dataset must be big. This is all true if the purpose is to explain the process in one sentence. Nonetheless, if you fully grasp the course of action a minor superior, then large facts is not as needed as it initially appears to be.

Why many people today assume nothing will work without having big information

To commence with, let’s examine what a dataset and schooling are. A dataset is a selection of objects that are commonly labeled by a human so that the algorithm can comprehend what it should seem for. For case in point, if we want to locate cats in photographs, we have to have a set of photographs with cats and, for each and every image, the coordinates of the cat, if it exists.

During coaching, the algorithm is revealed the labeled details with the expectation that it will master how to forecast labels for objects, locate common dependencies and be ready to fix the problem on information that it has not witnessed.

>>Don’t miss our exclusive challenge: The quest for Nirvana: Applying AI at scale.<<


Transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.


Register Now

One of the most common challenges in training such algorithms is called overfitting. Overfitting occurs when the algorithm remembers the training dataset but doesn’t learn how to work with data it has never seen.

Let’s take the same example. If our data contains only photos of black cats, then the algorithm can learn the relationship: black with a tail = a cat. But the false dependency is not always so obvious. If there is little data, and the algorithm is strong, it can remember all the data, focusing on uninterpretable noise.

The easiest way to combat overfitting is to collect more data because this helps prevent the algorithm from creating false dependencies, such as only recognizing black cats.

The caveat here is that the dataset must be representative (e.g., using only photos from a British shorthair fan forum won’t yield good results, no matter how large the pool is). Because more data is the simplest solution, the opinion persists that a lot of data is needed.

Ways to launch products without big data

However, let’s take a closer look. Why do we need data? For the algorithm to find a dependency in them. Why do we need a lot of data? So that it finds the correct dependency. How can we reduce the amount of data? By prompting the algorithm with the correct dependencies.

Skinny algorithms

One option is to use lightweight algorithms. Such algorithms cannot find complex dependencies and, accordingly, are less prone to overfitting. The difficulty with such algorithms is that they require the developer to preprocess the data and look for patterns on their own.

For example, assume you want to predict a store’s daily sales, and your data is the address of the store, the date, and a list of all purchases for that date. A sign that will facilitate the task is the indicator of the day off. If it’s a holiday now, then the customers will probably make purchases more often, and revenue will increase.

Manipulating the data in this way is called feature engineering. This approach works well in problems where such features are easy to create based on common sense.

However, in some tasks, such as working with images, everything is more difficult. This is where deep learning neural networks come in. Because they are capacious algorithms, they can find non-trivial dependencies where a person simply couldn’t understand the nature of the data. Almost all recent advances in computer vision are credited to neural networks. Such algorithms do typically require a lot of data, but they can also be prompted.

Searching the public domain

The first way to do this is by fine-tuning pre-trained models. There are many already-trained neural networks in the public domain. While there may not be one trained for your specific task, there is likely one from a similar area.

These networks have already learned some basic understanding of the world they just need to be nudged in the right direction. Thus, there is only a need for a small amount of data. Here we can draw an analogy with people: A person who can skateboard will be able to pick up longboarding with much less guidance than someone who has never even stood on a skateboard before.

In some cases, the problem is not with the number of objects, but the number of labeled ones. Sometimes, collecting data is easy, but labeling is very difficult. For example, when the markup is science-intensive, such as when classifying body cells, the few people who are qualified to label this data are expensive to hire.

Even if there is no similar task available in the open-source world, it is still possible to come up with a task for pre-training that does not require labeling. One such example is training an autoencoder, which is a neural network that compresses objects (similar to a .zip archiver) and then decompresses them.

For effective compression, it only needs to find some general patterns in the data, which means we can use this pre-trained network for fine-tuning.

Active learning

Another approach to improving models in the presence of undetected data is called active learning. The essence of this concept is that the neural network itself suggests which examples it needs to label and which examples are labeled incorrectly. The fact is that often, along with the answer, the algorithm gives away its confidence in the result. Accordingly, we can run the intermediate algorithm on unnoticed data in search of those where the output is uncertain, give them to people for labeling, and, after labeling, train again.

It is important to note that this is not an exhaustive list of possible options these are just a few of the simplest approaches. And remember that each of these approaches is not a panacea. For some tasks, one approach works better for others, another will yield the best results. The more you try, the better results you will find.

Anton Lebedev is a chief data scientist at Neatsy, Inc.


Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers

Leave a Reply

Your email address will not be published. Required fields are marked *