• Sat. Apr 20th, 2024

LLM integration takes Cloudera data lakehouse from Big Data to Big AI

Bynewsmagzines

Jun 6, 2023
Google dives into the 'supercomputer' game by knitting together purpose-built GPUs for large language model training

[ad_1]

Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More


Cloudera got its start in the Big Data era and is now moving quickly into the era of Big AI with large language models (LLMs).

Today, Cloudera announced its strategy and tools for helping enterprises integrate the power of LLMs and generative AI into the company’s Cloudera Data Platform (CDP). Cloudera’s platform provides an open data lakehouse model that enables organizations to run data analytics operations on top of data lake storage.

With LLM integration, Cloudera is making it easier for enterprises to directly integrate with open-source LLMs from Hugging Face and open-source vector databases to build AI applications. Alongside the LLM integration, Cloudera also announced the general availability of its observability platform, which will help organizations to monitor data workloads running on CDP.

“You can now take advantage of this new way of processing data and getting real-time insights at a scale that has never been possible before,” Ram Venkatesh, CTO of Cloudera, told VentureBeat. “All the hype aside, I have been a SQL guy for a long time, but I can tell you that the ability to analyze all your data, especially the unstructured and semi-structured pieces, we have never had it better than what is promised … with LLMs today. “

Event

Transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.

 

Register Now

How Cloudera is bringing LLMs to its data lakehouse

Cloudera is not building its own LLMs; rather, it is making it easier for enterprises to use LLMs to gain insights from data that organizations already have in a data lakehouse.

Cloudera already has a catalog of reference architectures that it provides to its users; existing use cases have included AI models for customer churn and fraud analytics. Now the company is expanding with architectures for conversational AI and LLMs. Venkatesh explained that CDP users can select the new LLM reference architecture from the catalog and have it installed in their environment in a few minutes.

The training approach that Cloudera is embracing is what is known as a zero-shot learning model, where an existing LLM can quickly benefit from an existing data source. The initial set of LLMs that Cloudera is integrating with are open-source models that can run entirely inside the Cloudera platform. Venkatesh noted that by running the LLM in the same platform as the data, organizations can ensure that no data ever leaves the enterprise’s purview and no external API calls are being made. He emphasized that keeping data under tight control is critical for some enterprises.

The intersection between vector databases and Cloudera’s data lakehouse platform

Part of the Cloudera LLM reference architecture is the integration of open-source vector databases into the stack. 

Venkatesh said that Cloudera is enabling its users to choose which open-source vector database to use. Among the options are Milvus, Weaviate and qdrant.

Data lakehouse technology relies on data object storage, which Venkatesh said is often a great way for organizations to store unstructured and semi-structured data. To work with AI, there is a need to organize the data with a vector database.

“You really need a database engine that can take a semantic search query, run it in vector space, and return the most relevant results back to you,” he said.

Venkatesh emphasized that creating a vector database for an LLM deployment with Cloudera does not mean enterprises are duplicating data, with one set in the lakehouse and another in the vector database. Rather than duplicating data, what a vector database does is provide a functional index of the data as vectors.

How LLMs are the logical path forward from Big Data

When Cloudera got started in 2008, Big Data, in the form of the open-source Hadoop project, was the company’s foundation.

The Big Data market has shifted over the years into the data lakehouse space, where organizations use query engines, typically SQL-based, for data analytics on data stored in cloud object storage repositories. Venkatesh now sees LLMs as the next logical step on the path forward from Big Data.

“A bunch of us came to work in Big Data, not because we were all excited about SQL, but to look at fundamentally different ways to analyze data,” Venkatesh said.

He explained that Big Data created a pyramid-like approach for data analytics, where the Big Data resides at the bottom and only a small amount of data could be analyzed at the top. With LLMs, that pyramid structure has flattened out, with significantly more data available for analysis, and easier methods.

“What I see with LLMs and the new wave of AI is an era where you can now analyze all the data at the topmost layer and instead of querying with just SQL or Spark, it’s English or natural language queries,” Venkatesh said. “You only need to ingest the data once and you can get the benefits of that ingestion from a vectorized embedding multiple times, so all of your queries can take advantage of the semantic store.”

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Leave a Reply

Your email address will not be published. Required fields are marked *