Generative AI, RAG and Data Infrastructure
A practical introduction to Generative AI, RAG and their data infrastructure
Part of my job is to promote AI to people in IT. One or two years ago, I often started with something like what is AI and machine learning and why do they matter. This has changed recently. I realised in some recent events that terms like Generative AI and RAG were being talked so often, and maybe I don’t have to start with those basics anymore.
Generative AI predicts and generates the next token/pixel/code within context. Large language models (LLMs) such as ChatGPT and Llama are some of the most popular generative AI for text generation. Multimodal generative AI which supports text, image and video as an input and output are getting better and better.
Many generative AI models perform surprisingly well (close or even exceed human level in some cases), however, sometimes they also generate content that is not true (hallucinations) or up-to-date. This is where Retrieval Augmented Generation (RAG) can be very useful. By retrieving relevant information from a database of documents and using it as a context, RAG enhances the generation to produce more informed and accurate outputs.
In this and a few following posts, I will talk about generative AI and RAG from data infrastructure’s point of view.
Generative AI and data infrastructure
Generative AI models are huge. LLMs with tens to hundreds of billions parameters are common. Each parameter in a model is a floating point number. A full-precision float number takes 32 bits (4 bytes) in computer memory.
Therefore, 4GB memory is needed to load a model with 1B parameters. And this is only to load the model, extra memory are required to train it. It is estimated that for every 1B parameters, 24GB of GPU memory is needed for training. A bare minimum of 1680GB GPU memory is needed to train the 70B Llama 2 model. This means a minimum of 3 NVIDIA DGX H100 GPU servers (8 GPUs, 640GB GPU memory per server) are required. However, training would take years with this. In practice, much more GPUs are used to train a large model in reasonable time. The GPT-3 model was trained on 10,000 GPUs. Most AI customers I worked with have tens to hundreds GPU servers.
Data infrastructure for training generative AI
GPUs are powerful and expensive. From data infrastructure’s point of view, the question is how to keep the GPUs busy with fast data? First, I think we need to understand the data I/O patterns in GPU-powered AI system. Both I/O reads in training, and writes in checkpointing are highly parallel and throughput bound. I/O requirement also depends on model architecture, hyper parameter such as batch size, and GPU speed. For DGX H100 based large system, we can refer to the guideline for storage performance from the NVIDIA SuperPOD reference architecture:
Some customers I have worked with simply use the Better column as a reference, whereby some prefer proof of concept (POC) tests (check my blog on benchmarking storage for AI).
It is beyond the scope of this blog but I would like to highlight that besides I/O performance, enterprise storage feature such as non-disruptive upgrade, data protection, and multiple protocol support, especially NFS and S3, are also important when choosing data storage for generative AI system.
So far we have been focusing on training. How about inference, or RAG in particular? How do we build optimal data infrastructure for RAG powered generative AI application?
RAG and data infrastructure
RAG combines external information sources with LLMs. It is an inference phase integration in the generative AI project lifecycle. This means we don’t have to go through the expensive training phase, which could run for hours to weeks, to include validated and latest information to LLM generation. Instead of sending user input query directly to a LLM, RAG first retrieves validated information from external data sources such as databases and documents, then it sends these information together with the user query as context to the LLM. This leads to better generation.
For example, I asked “What is Pure Storage Evergreen One?” to a 7B Llama 2 model. I got the following response, which is far from correct.
Llama 2 does not have proper knowledge about the question because it was not included in the datasets used to train the model. This is common to many enterprises whose domain knowledge is not included in the LLMs trained with public datasets.
To improve, I used RAG with the Evergreen One html page as the external data source.
With RAG, I got a much better response.
Data infrastructure for RAG
RAG encodes external data (the html page in the above example) so that it can easily retrieve the relevant parts of the data on query. The best option for storing and retrieving external data for RAG is a vector database, as vector database supports similarity search that enables RAG to quickly retrieve data that is relevant to user query. There are multiple vector database options, both open source and commercial products. Few examples are ChromaDB, Pinecone, Weaviate, Elasticsearch with the vector scoring plugin, and Pgvector — PostgreSQL with vector extension.
I/O pattern for these vector databases are different from generative AI training. We need high throughput writes for ingestion, and low latency reads for retrieval. While NFS is the protocol of choice for training, we may use block, NFS or S3 here depending on the vector database software. We may also need something like Portworx or a CSI driver to provide persistence storage when running the vector database on Kubernetes, which is quite common.
Like any database, we should also plan carefully on the data “schema” in these databases. What are the considerations when storing different types of data including HTML, PDF, Powerpoint and Word in a vector database? Should we include metadata such as modification time and document structure? How about tables in the documents? These are all important practical designs for building data infrastructure for RAG.
Since data are encoded as high demential vectors, data can expand to multiple times bigger stored in a vector database for optimal RAG performance. This is also important to storage sizing.
Conclusion
In this blog, I briefly introduced generative AI, RAG and the unique impacts and requirements they impose to data infrastructures. It is rare for an individual (including myself) or even a team to master data infrastructure, generative AI and RAG application, and everything in between. I try to connect the dots within my knowledge and hopefully this helps.
In future posts, I will share more examples and dig into technical details of those generative AI applications, RAG and their data infrastructure.