Snorkel AI looks beyond data labeling for generative AI


Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More


Data labeling has long been a critical component of helping data scientists to prepare data for machine learning (ML) and artificial intelligence (AI). In the modern era of Generative AI, the role of data labeling is changing.

Today Snorkel AI is announcing new capabilities that extend beyond data labeling, to help organizations, curate and prepare data for Generative AI.  Snorkel AI has been developing a data platform that helps organizations with the data side of AI. Back in November 2022, the company’s Snorkel Flow technology was updated with features that enable organizations to accelerate the often labor intensive process of data labeling, by using large language models (LLMs) to jump start the process.

Now Snorkel is going a step further with its new GenFlow service for building generative AI applications and the Snorkel Foundry that helps organizations build customized LLMs.

“How you curate, sample, filter and clean data ends up having a tremendous impact on the resulting foundation model that you get out,”Alex Ratner, CEO and cofounder at Snorkel AI, told VentureBeat in an exclusive interview. “In other words, you can’t just dump in a  random mix of garbage data, and expect these models to turn out well.”

Event

Transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.

 

Register Now

Getting Gen AI to work without good data is a hallucination

A common risk that faces generalized generative AI tools is that of hallucination, where responses are not accurate.

“Hallucinations are just another kind of error that is a result of not training the model to do a specific task in the first place,” Ratner said.”These models are trained out of the box to say statistically plausible sounding things given an input prompt.”

Ratner added that fundamentally hallucinations occur as a result of a model not being trained for a specific task or more importantly not having all the right information in order to be accurate. One approach to solving the issue that multiple vendors are pursuing is with the concept of Retrieval Augmented Generation (RAG), where sources for the generated results are cited. But what happens when there are no sources? That’s a data problem and it’s an issue that Snorkel is looking to solve with its Snorkel Foundry.

What Snorkel Foundry does is data curation. Organizations can point the service at a data repository as part of a pre-training phase, to help data scientists get the right mix of data to meet business objectives, reduce bias and the risk of hallucination.

While some data that an organization might have will have structure, such as a database, Ratner expects that the majority of data will likely be unstructured. The Snorkel Foundry enables users to make use of all the unstructured data and also helps them to pick the right mix of data to get the best results for an LLM.

Ratner explained that Snorkel Foundry has a data sampling function that enables users to  heuristically or through a model driven approach,  identify data relevance to help compose the right balance of content to put into an ML training routine.

“Most enterprises don’t have  perfectly curated data,” Ratner said. ” So we’re helping them do that programmatically, so they can, you know, organize, and curate and optimize the mixture of data.”

Beyond Data Labeling with GenFlow

After pre-training an LLM, a common step is to execute additional instruction tuning, with common approaches including RLHF (reinforcement learning from human feedback).

“Once you pre train the model on a big unlabeled corpus of data you get to teach it or fine tune it to make better summaries or answer questions and have better dialogue,” Ratner said.

With Snorkel Flow for non-generative AI use cases, Ratner said his company helps to classify data with tags so it’s effectively labeled properly. For generative AI outputs, that type of labeling is not what’s needed, which is where the new GenFlow service fits in.

GenFlow is about providing the right tooling and management capability to provide feedback to help filter out poor quality data points in an effort to help generative AI generate an optimal output.

Why Data Labeling Isn’t Dead

For all the hype around Generative AI in recent months, Ratner argued that in the long run he expects most of the enterprise value from AI will come from more traditional predictive AI.

Ratner emphasized that data labeling remains important for predictive AI tasks, such as classifying fraud. Fundamentally data labeling is a type of feedback that is given to help improve a model.

With generative AI there is still a need for feedback, but it takes a different form than it does for predictive AI. Rather than labeling something as one type or another, the feedback is more that an individual prefers one summary or response to another.

“As you go through that process of assembling, curating and developing over time, this feedback, whether it’s labels or long form answers ratings, we’re trying to make that more programmatic, accelerated and better managed,”he said.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Recent Articles

spot_img

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox