[ISSUE #7] LLM Data Sources, LlamaIndex, GPT4All, Google's Image Editor
Towards using language models privately🤫
This week I got my hands dirty with the ongoing LLM hype. I realized that I will need another outlet to showcase my experiments and go a little bit deep into the technicalities. And that’s why I’m prepping my blog to house my deep dives. Let me know what topic/subject you would want me to take a plunge into. And as always, this newsletter will be home to the latest ongoings and the whats and whys of AI.
In this issue -
📜 LLM Data Sources: How the hell do language models know what they know (and don’t know)?
🦙 LlamaIndex: A way to give language models context to your own private documents!
✍🏽 GPT4ALL: A software which runs LLMs right on your PC/Macbook! (without giving away your private data)
🖌️ Imagen Editor: Google’s work on allowing accurate in-painting.
LLM Data Sources📜
Were you ever curious about what they mean by trained on the whole internet? How does chatGPT have the answer to the most obscure question you throw at it? This tweet thread lists all the major sources of data used by large language models.
Common Crawl
Common Crawl (CC) is a non-profit organization that allows open access to information. They achieve this by giving free access to high-quality crawl data. The common crawl corpus contains petabytes of data collected since 2008. It contains
raw web page data
extracted metadata
text extractions
So 90% of it is boilerplate code (HTML, CSS) and other gibberish. But you can find the colossal cleaned version of Common Crawl (C4) prepared by AllenAI hosted on HuggingFace. You can download the original CC data over here. They have also provided some examples of how people have used the data till now. The data can be accessed via AWS S3 buckets or via HTTP.
WebText
OpenAI required around 40GB of high-quality text corpus for training GPT2 so they went beyond CC and wanted higher-quality curation for a more modern language model. They collected WebText using the following tactic.
Scraped URLs from Reddit submissions with a score of 3 and higher (upto Dec 2017)
Deduplicated scraped content based on URL
Excluded Wikipedia since they already had a separate Wikipedia dataset
Removed non-English web pages and deduplicated content furthermore with an undisclosed “heuristic-based cleaning”.
Unfortunately, they did not open-source the resulting corpus or the generation source code. This allegedly resulted in Reddit starting to charge for their APIs because they want to get paid for helping teach big AI systems.
Books
OpenAI has mentioned Books2 which is still a mystery on how they have collected the data (maybe all of libgen?). But the open source counterpart of books is The Pile (Books3) and BookCorpusOpen (Books1) by Shawn Presser. The Pile consists of around 197,000 books processed in the exact same way as bookcorpusopen and amounts to 100GB in size.
Finally, many open-source language models also make use of wikipedia, arxiv and stackexchange datasets.
Llama Index🦙
No! It’s not related to Facebook’s OpenLLaMa language model which they open-sourced last week.
Llama index is a data framework that allows you to connect your own data source for any downstream use case such as question-answering or summarization. It not only supports PDFs and texts but a rapidly increasing number of data sources including Notion, SQL databases, confluence, Asana, and many more at the Llama Hub.
The Llama index allows you to -
structure your data (graphs, indices)
provides data connectors (Llama Hub)
provides advanced retrieval/query over your custom data
allows integrations with other apps (ChatGPT, LangChain etc.)
For example, using LlamaIndex, I could connect OpenAI’s LLM to my personal source of information i.e. a pdf of my resume so that I could answer some personal questions.
I can now talk to and query my custom data source, which only required 21 lines of code! However, in this process, my data sources are read by OpenAI’s servers which most people would want to avoid. The next item provides a very convenient and fast solution to this problem.
GPT4ALL (Nomic AI)✍🏽
GPT4All is an open source software ecosystem that allows anyone to train and deploy powerful and customized large langauge models on everyday hardware.
They have released a chat client which can run on OSX/Windows/Linux using one of the local large language models. This means you can literally chat with AI agents as powerful as OpenAI’s ChatGPT right on your Macbook/PC without an internet connection and with total privacy!
This has been achieved by quantizing open-source language models i.e. using a 4-bit precision version of a 16-bit version of the model, thus making it small and runnable on consumer-grade laptops. GPT4All software is optimized to run inference on 7-13 billion parameter language models on the CPUs of daily computing devices!
The ecosystem currently supports various versions of the following three model architectures -
GPTJ (OpenAI)
LLAMA (Facebook)
MPT (MosaicML)
Additionally, you can side-load any custom model (based off of the above architectures) into the chat client. Furthermore, the LocalDocs chat plugin now allows you to use your own local documents to answer your questions! You can essentially chat with your own data without the private data leaving your computer. It can even cite the sources within your documents!
Imagen Editor and EditBench (Google)🖌️
Are you looking to tweak objects in your vacation photos quickly and without effort? You can do that easily with text-guided image editing (TGIE).
Increased research in the field of text-to-image generation has proven to be a catalyst for TGIE. Instead of completely redoing images, TGIE provides a quick, automated, and controllable way of editing generated and photographed visuals.
When a user provides text instructions along with a mask (indicating the area to modify), it fills the marked area with relevant and in-context content to complete the image.
Innovative technique
It is a diffusion-based model fine-tuned on the Imagen dataset.
Unlike previous in-painting techniques, the Imagen editor employs an object-detector mask i.e. the masks are created using an object-detector module.
This allows for an increased alignment between the edit text prompts and the masked regions in addition to addressing the problem of text prompts being disregarded when the masks are small or incomplete.
Evaluation
The authors have also provided a detailed evaluation platform called EditBench which evaluates performance on three different types of text prompts -
Mask Simple: single attribute description
Mask Rich: multi-attribute description of masked object
Full Image: entire image description
I scroll through endless articles, and Reddit and Twitter posts (so you don’t have to) and deliver the cream of the crop to you! If you would like a copy of this issue in your mailbox the next time I write, consider subscribing 🤗 Thanks for taking the time to read!