• Thu. Apr 25th, 2024

Databricks releases Dolly 2.0, the first open, instruction-following LLM for commercial use


Apr 12, 2023
Databricks releases Dolly 2.0, the first open, instruction-following LLM for commercial use


Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More

Today Databricks released Dolly 2.0, the next version of the large language model (LLM) with ChatGPT-like human interactivity (aka instruction-following) that the company released just two weeks ago.  

The company says Dolly 2.0 is the first open source, instruction-following LLM fine-tuned on a transparent and freely available dataset that is also open-sourced to use for commercial purposes. That means Dolly 2.0 is available for commercial applications without the need to pay for API access or share data with third parties.

According to Databricks CEO Ali Ghodsi, while there are other LLMs out there that can be used for commercial purposes, “They won’t talk to you like Dolly 2.0.” And, he explained, users can modify and improve the training data because it is made freely available under an open source license. “So you can make your own version of Dolly,” he said. 

Databricks released the dataset Dolly 2.0 was trained on

In addition, Databricks said that as part of its ongoing commitment to open source, it is also releasing the dataset on which Dolly 2.0 was trained, called databricks-dolly-15k. This is a corpus of more than 15,000 records generated by thousands of Databricks employees, and Databricks says it is the “first open source, human-generated instruction corpus specifically designed to enable large language to exhibit the magical interactivity of ChatGPT.”


Transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.


Register Now

There has been a wave of instruction-following, ChatGPT-like LLM releases over the past two months that are considered open source by many definitions (or offer some level of openness or gated access), including Meta’s LLaMA, which in turn inspired others like Alpaca, Koala, Vicuna and Databricks’ Dolly 1.0.

Many of these “open” models, however, were under “industrial capture,” said Ghodsi, because they were trained on datasets whose terms limit purport to limit commercial use — such as a 52,000 question and answer dataset from the Stanford Alpaca project that was trained on output from OpenAI’s ChatGPT. But OpenAI’s terms of usage, he explained, includes a rule that you can’t use output from services to compete with OpenAI.

Databricks, however, figured out how to get around this issue: Dolly 2.0 is a 12B parameter language model based on the open source Eleuther AI pythia model family and fine-tuned exclusively on a small, open source corpus of instruction records (databricks-dolly-15k) generated by Databricks employees. This dataset’s licensing terms allow it to be used, modified and extended for any purpose, including academic or commercial applications.

Models trained on ChatGPT output have, up until now, been in a legal gray area. “The whole community has been tiptoeing around this and everybody’s releasing these models, but none of them could be used commercially,” said Ghodsi. “So that’s why we’re super excited.”

Dolly 2.0 is small but mighty

A Databricks blog post emphasized that like the original Dolly, the 2.0 version is not state of the art, but “exhibits a surprisingly capable level of instruction-following behavior given the size of the training corpus,” adding that the level of effort and expense necessary to build powerful AI technologies is “orders of magnitudes less than previously imagined.”

“Everyone else wants to go bigger, but we’re actually interested in smaller,” Ghodsi said of Dolly’s diminutive size. “Second, it’s high quality. We looked over all the answers.”

Ghodi added that he believes Dolly 2.0 will start a “snowball” effect — where others in the AI community can join in and come up with other alternatives. The limit on commercial use, he explained, was a big obstacle to overcome: “We’re excited now that we finally found a way around it. I promise you’re going to see people applying the 15,000 questions to every model that exists out there, and they’re going to see how many of these models suddenly become kind of magical, where you can interact with them.”

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Leave a Reply

Your email address will not be published. Required fields are marked *