[ISSUE #6] I want to use Large Language Models! But how?
Did you know that you can train your own large language model for as low as $8?
In the past week, I gathered around 6 interesting articles and papers but scrapped EVERYTHING to answer the most frequently asked questions about LLMs. While scouring the web and gathering those articles, I realized that there were some missing links in my head and I couldn’t confidently answer some of the questions I got from my colleagues, friends, and you - the readers!
So I spent the majority of yesterday clarifying the following questions -
What does the Large Language Model (LLM) market look like?
Why doesn’t everyone use OpenAI API? Or train their own “GPT4”?
Are open-source language models possible and currently out there?
What is the tradeoff between “using more data” and “increasing the model size”?
What are some of the open-source LLMs available?
How expensive is it to train/fine-tune an open-source LLM on my private data?
How can I deploy the available open-source models OR my fine-tuned models?
So this article is a bit different than the usual newsletter. But I promise we’ll tune back in the next week! I hope you find it useful and let me know in the comments if you have any further questions or shoot a DM 😄
What does the Large Language Model (LLM) market look like?
Large companies like OpenAI spend millions of dollars training large language models with billions of parameters using a couple of thousand of GPUs (ChatGPT used 10k NVIDIA GPUs). That is why they become less willing to share these models, the underlying dataset, and the training details.
Additionally, the competition to dominate the generative AI market incentivizes companies to keep their technology secret. For a while, black box APIs (OpenAI API), became popular mostly because there was no alternative.
Why doesn’t everyone use OpenAI API? Or train their own “GPT4”?
Organizations in most verticals, such as finance, health, and insurance, have very strict privacy policies preventing them from using OpenAI’s black box API.
On the other hand, OpenAI has released very few details related to the actual training, model sizes, and underlying costs in their technical report for competitive reasons.
Considering GPT3 had 175 billion parameters, we can assume GPT4 has even more which becomes a training logistic nightmare, especially for small and mid-sized companies.
Are open-source language models possible and currently out there?
YES!
In the past few months, there has been a wave of open-source LLMs! The main catalyst for this boost was a paper by researchers in DeepMind who showed that performance of a language model can be improved by training on more quality data instead of increasing the size of the LLMs.
So you can run your OWN ChatGPT-like chatbot without giving your data to the OpenAI.
What is the tradeoff between “using more data” and “increasing the model size”?
The smaller the size of the model, the faster the inference.
For reference, GPT-3 had a size of 175 billion parameters trained on 300 billion tokens (very superficially, sentences are broken into words, also called tokens, before being fed to the model for training). That means, around two tokens per parameter.
On the other hand, Chinchilla, the model introduced by DeepMind, had a size of 70 billion parameters and the model was trained on 1.4 trillion tokens. That amounts to 20 tokens per parameter.
This allows Chinchilla to be fine-tuned for downstream tasks and run in a cost-efficient manner. Additionally, it gives better performance than GPT-3 while being 2.5x smaller in size.
What are some of the open-source LLMs available?
Hugging Face has maintained a leaderboard of open-source large language models here including the four popular benchmarks against which the models are being evaluated.
There are two types of models - pre-trained and fine-tuned.
pre-trained - These are the foundational models that have been trained from scratch on a particular dataset (e.g. open-source language modelling datasets like Pile)
fine-tuned - These models are based on the foundational models i.e. the foundational models are further fine-tuned on
either the output from popular, accurate models like ChatGPT (ShareGPT) or
domain-specific datasets
I have compiled a list of ones I found were referenced quite frequently and have essentially made a dent in how people use LLMs -
Falcon by Technology Innovation Institute (pre-trained)
StableLM by Stability AI (pre-trained)
MosaicPretrainedTransformer (MPT) by MosaicML (pre-trained)
RedPajama by Together Computer (pre-trained)
Dolly by Databricks (fine-tuned)
How expensive is it to train/fine-tune an open-source LLM on my private data?
The enterprises which have proprietary data can now efficiently fine-tune one of the foundational models at a very low cost and even on consumer-grade GPUs!
According to another open-source model listicle, most models can be fine-tuned for less than $100. In fact, Cabrita, a Portuguese LLM, was fine-tuned on Llama for just $8 using ONE A100 GPU on Google Colab!!
How can I deploy the available open-source models OR my fine-tuned models?
The Large Model Systems Organization (LMSYS Org) has built a platform for training, serving, and evaluating large language models called FastChat.
In their article, they have demoed how a fine-tuned model - Vicuna-13B (also fine-tuned by LMSYS Org using the original Llama model) can be used for two applications -
Question-answering on your private documents
Explaining code in the wild
I scroll through endless articles, and Reddit and Twitter posts (so you don’t have to) and deliver the cream of the crop to you! If you would like a copy of this issue in your mailbox the next time I write, consider subscribing 🤗 Thanks for taking the time to read!