Vector Database and Storage

Is it true generative AI and RAG increase data storage by up to 10x?

Yifeng Jiang
5 min readMay 30, 2024
Photo by Bro Takes Photos on Unsplash

At this year’s GTC in San Jose, one slide from a NVIDIA session caught my eye, and when I talked about that to my colleagues, they all got excited. In the session, the speaker claimed that data expends when embedded, then stored uncompressed for optimal RAG, can increase data storage by up to 10x.

A 10x increase is a lot — 100TB becomes 1PB, and 1PB becomes 10PB. No wonder why people in storage companies got really excited. But is it true that generative AI and RAG really expands data storage usage that much? And why is that? Let me try test and confirm this.

In my previous blog, I briefly wrote about how RAG works and its data infrastructure. RAG encodes external data so that it can easily retrieve the relevant parts of the data on query. The best option for storing and retrieving external data for RAG is a vector database, as vector database supports similarity search that enables RAG to quickly retrieve data that is relevant to user query. To understand its storage usage, we need to dig a little deeper.

How does Vector Database Work?

A vector database is designed to efficiently store, index, and query data in the form of vectors, which are arrays of numbers representing data points in a high-dimensional space.

Here’s how a vector database typically indexes data (at high level):

1. Vector Representation: First, the data, whether it’s images, text, or any other form of multimedia, is converted into vectors. Each vector represents the features of the data item in a numerical format. This is often done using neural networks models or feature extraction algorithms.

2. Indexing: The vectors are then indexed to facilitate efficient retrieval. Vector databases use specialized indexing algorithms to manage and search through these high-dimensional spaces effectively.

3. Querying: When querying a vector database, the query itself is also converted into a vector using the same method as the indexed data. The database then uses the indexed structure to quickly retrieve vectors that are similar to the query vector, typically by calculating distances or similarities.

In real applications, hundreds or even higher dimensions are common. To handle large datasets and ensure quick retrieval, vector databases often implement additional optimizations. These include using GPUs for parallel computations, distributing the database across multiple machines, and implementing efficient cache mechanisms.

Vector Database Solution

There are multiple choices of vector database, including open source and commercial software, and managed services. Some popular ones include: Pipecone, Weaviate, Chroma, Qdrant, Milvus and Vespa. Because vector database is a critical building block in RAG, some classical databases, such as Redis, MongoDB and Elasticsearch, also started to support for vector capability.

I chose to use Milvus to verify RAG data usage expansion. Given our foucs on AI at scale, I am particularly intesrested in the following Milvus features:

  • Support GPU
  • Support object storage
  • Deploy on Kubernetes as a distributed system

These are supposed to contribute to Milvus’ performance and scalability.

Vector Database and Storage

To test vector database and storage usage, I download a chunk of papers (152 PDF files) from arXiv. I extract text from the PDF papers. I then embed the text into 768 dimensional vectors using a sentence-transformers model, and store the vectors in Milvus. I finally compare the size of original PDFs, extracted text, and the Milvus database. Some code snippets in the below.

Extract text from PDF:

import pdfplumber

def extract_text_from_pdf(pdf_path):
text = ""
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
text += page.extract_text()
return text

Split text into chunks, and create embedding vector for each chunk:

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
splits = text_splitter.split_text(text)

# Creating embeddings
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

hf = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
embeded = hf.embed_documents(splits)

Insert into Milvus:

# Define Milvus schema and create collection
schema = CollectionSchema(fields, description="arXiv paper", auto_id=True, primary_field='id')
collection = Collection(name="arxiv", schema=schema, shards_num=1)

# Insert to Milvus
data = [pub_months, file_seqs, file_sizes, chunk_nums, embeddings]
collection.insert(data)
collection.flush()

Create Milvus index:

index_params = {"index_type": "HNSW", "params": {"M": 8, "efConstruction": 200}, "metric_type": "L2"}
collection.create_index(field_name='chunk_embedding', index_params=index_params)

To check everything works, I conduct a hybrid search in Milvus:

# Load the Milvus collection to perform the search
collection.load()
query = 'Generative AI and RAG application.'
query_embedding = hf.embed_query(query)

# Conduct a hybrid vector search
search_param = {
"data": [query_embedding],
"anns_field": "chunk_embedding",
"param": {"metric_type": "L2"},
"limit": 3,
"expr": "pub_month == 2401",
}
res = collection.search(**search_param)

# Check the result
hits = res[0]
print(f"- Total hits: {len(hits)}, hits ids: {hits.ids} ")
print(f"- Top1 hit id: {hits[0].id}, distance: {hits[0].distance}, score: {hits[0].score} ")

Here is the search result:

- Total hits: 3, hits ids: [449571692599744330, 449571692599740033, 449571692599740035] 
- Top1 hit id: 449571692599744330, distance: 0.902103066444397, score: 0.902103066444397

Storage usage comparison

The PDF and extracted text files are stored on local filesystem. The vectors in Milvus are stored in FlashBlade S3. So I use this command to get S3 usage of the database:

aws s3 --endpoint-url=http://10.226.224.193 --profile fbs \
ls --summarize --human-readable --recursive s3://milvus

Now let’s compare the storage usage of PDF, text and vector database.

  • PDF: 520MB
  • Text: 11MB
  • Vector: 120MB

Observation from the above:

  • PDF format has the biggest size because images and charts are not converted/stored in the other two formats in this test.
  • Compared to text, vector format uses up to 10x storage to store the same content (text data extracted from PDF)

Why does vector database use 10x more storage than text?

Most common English text, including all lowercase and uppercase letters, digits, and basic punctuation, falls within the ASCII range, which UTF-8 encodes using a single byte per character. It is about 1KB on storage for every 1000 characters.

In this test, we split the text to 1000-character chunks (chunk_size=1000), we then encode each chunk into 768 dimensional (dim=768) vector, which is represented as 768 floating point numbers. A 32-bit floating point number takes 4 bytes. Therefore, for each 1KB text (1000 characters) we have 768 x 4 = 3072B in vector. About 3x amplification, which depends on chunk size and vector dimension.

Another big overhead comes from the indices. Depending on the index parameter, index data could be as big as, or even bigger than the embedding vectors. For example, in my environment, index files take almost half of the total size of the Milvus bucket. This adds up another 3x or so amplification.

Due to the high-dimensional vector and index files, I am not surprised to find out a 10x data expansion after ingesting text data into a vector database.

Conclusion

In this simple test, I confirmed that it is actually true that generative AI and RAG could increase data storage by up to 10x. Another reason why fast enterprise storage with built-in data compression is crucial for AI.

--

--

Yifeng Jiang

Software & solutions engineer, big data and machine learning, jogger, hiker, traveler, gamer.