SMG
@ho30v8kolwbn73fj
Joined Oct 4, 2024
0
Following
0
Followers
wandb.ai/site/articles/rag-techniques/
Jan 3, 2025
21
tomtunguz.com/software-playbook/
Nov 8, 2024
3
tomtunguz.com/klarna-ai/
Nov 8, 2024
1
tomtunguz.com/why-ltv-matters-more-in-2024/
Nov 8, 2024
1
tomtunguz.com/elo-improvement/
Nov 8, 2024
4
tomtunguz.com/cloud-earnings-q3-24/?utm_source=Iterable&utm_medium=email&utm_campaign=newsletter-20241107
Nov 8, 2024
12
tomtunguz.com/disrupt-2024/
Nov 8, 2024
1
news.ycombinator.com/item?id=41325543
Nov 5, 2024
2
a16z.com/cloud-lessons-for-the-ai-era/
Oct 22, 2024
1
a16z.com/selling-winning-new-business-genai/
Oct 22, 2024
12
a16z.com/vertical-saas-now-with-ai-inside/
Oct 21, 2024
5
www.bettercapital.vc/blog/humans-software-saas-outcome-aas/
Oct 4, 2024
1
The basic workflow of a RAG system is straightforward yet powerful. When a query is received, the system first converts it into a vector representation. This query vector is then used to search the vector database for similar content. The most relevant information is retrieved and fed into the generator along with the original query. For example, if asked about recent climate change policies, a RAG system might retrieve the latest environmental reports and use them to generate an up-to-date, factual response.
The result is a RAG ecosystem that’s as diverse as it is powerful. Some RAG techniques focus on improving retrieval accuracy, others on reducing computational costs, and still others on handling multi-modal data like images or audio.
This is the simplest form of RAG. This was the methodology which gained popularity shortly after ChatGPT was revealed to the world. This vanilla RAG follows a typical process that includes indexing, retrieval, and generation.
Generally the embedding models and language models have a certain context limitation. In simple words, they can only understand a certain amount of text at a time. If the text is too long, these models might not be as accurate or they might not be able to process the text at all. To align with the context limitations of language models, the text is divided into smaller, manageable segments called as chunks. These chunks are then transformed into vector representations using an embedding model and stored in a vector database for efficient similarity searches in the subsequent retrieval stage.
Retrieval: When a user query is received, it is converted it into a vector representation using the same embedding model used to encode the chunks. Next, the model calculates similarity scores between this query vector and the vectors of the indexed chunks. It then identifies and retrieve the top K chunks that exhibit the highest similarity to the query, which are incorporated as expanded context in the prompt.
Generation: This process involves synthesizing the original query and the retrieved documents into a cohesive prompt for the large language model to process. The model’s response strategy may vary based on task-specific requirements, allowing it to either leverage its inherent knowledge or confine its answers to the information provided in the retrieved documents. Additionally, one can adjust the tone and format of how the LLM should respond, depending on the query and context and the specific use-case being served.
Typically, most of the popular encoder models used for creating embeddings can handle about 512 tokens at a time.
RecursiveCharacterTextSplitter: This is often recommended as a starting point. It splits text based on a list of user-defined characters, trying to keep related pieces of text together.
HTMLHeaderTextSplitter and MarkdownHeaderTextSplitter: These split text based on HTML or markdown-specific characters, and include information about where each chunk came from.
SemanticChunker: This first splits on sentences, then combines ones next to each other if they’re semantically similar enough.
The main point of using a vector database is to create an index to allow for Approximate Search, so you don’t have to compute too many cosine similarities. For many use-cases, datasets with around some thousand vectors don’t even require index creation. If you can live with up to 100ms latency, skipping index creation can simplify your workflow while still guaranteeing 100% recall.
Metadata filtering
Use an LLM call to extract entities like Q2 2024 and legal division from the user query.
Use zero-shot entity detection models like GliNER. For example, given below are the Gliner extracted entities for our example query.
BM25: The old king still reigns
In the world of RAG, we often get excited about fancy embedding techniques. But let’s not forget about the good old full-text search – it’s still the king in many scenarios. And when we talk about full-text search, BM25 is the algorithm that stands out.
Why should we care about BM25 when we have cool embedding techniques? Well, here’s the thing: while embeddings are indeed cool, they have their limitations. It’s very hard to encode all the textual information in a mere 768-dimensional vector (768 is just a approximate figure here, many embedding models have varying embedding vector size). It’s a lossy compression.
And at the heart of many full-text search systems is BM25, an algorithm powered by tf-idf (term frequency-inverse document frequency).
To capture all the strengths and mitigate the pitfalls of embedding search, it’s a good idea to include keyword search in your pipeline. The current best practice is to use both keyword and vector-based search, retrieve documents using both methods, and then combine the results to get the final context. This is also referred as hybrid retrieval or hybrid search in many places.
HyDE: Hypothetical document embeddings
Now, let’s dive into a particularly clever query transformation technique called HyDE. This method addresses a common issue in RAG: the semantic gap between queries and document embeddings.
Here’s the problem: When a user searches for “How to grow tomatoes?”, our retrieval system might struggle. Why? Because documents often contain broader information than just tomato growing methods – they might include varieties, nutrition, and recipes. As a result, the system might return less relevant results about other vegetables, general gardening, or tomato pests. It could also retrieve documents that only briefly mention growing tomatoes or use similar words in different contexts, like “growing” a tomato business. These issues arise because the document embeddings aren’t perfectly aligned with the specific search query, potentially leading to less accurate and helpful search results.
s.
HyDE tackles this issue with a smart workaround. Here’s how it works:
We use a LLM to generate a fake document based on the search query. We basically ask the LLM to “write a passage containing information about the search query”.
We then use an embedding model to encode this fake document into embeddings.
Next, we use vector similarity search to find the most similar document chunks in our knowledge base to these hypothetical document embeddings. The key here is that we’re not using the search query to find relevant documents, but the fake HyDE ones.
Finally, we use the retrieved document chunks to generate the final response.
Reranking
If you’re looking for a way to instantly boost your RAG pipeline’s performance, look no further than reranking. In fact, I’d go as far as to say that reranking should be a default component in any RAG pipeline. It’s that powerful.
Enter the cross-encoder. A reranker typically uses a cross-encoder model, which evaluates a query-document pair and generates a similarity score. This method is more powerful because it allows both the query and the document to provide context for each other. It’s like introducing two people and letting them have a conversation, rather than trying to match them based on separate descriptions.
The reranking solution
Here’s where reranking comes in. The common practice is to:
First, retrieve the top k results (say, top 50) using a bi-encoder. This gives us approximate results quickly.
Then, rank those results using a cross-encoder to produce the top 10 outcomes.
It looks at each word’s “perplexity” – think of this as a measure of how surprising or informative a word (or token in terms of LLM) is. In terms of information entropy, tokens with lower perplexity contribute less to the over-all entropy gains of the language model. In other words, removing tokens with lower perplexity has a relatively minor impact on the LLM’s comprehension of the context. Words with low perplexity are like filler words – they don’t add much meaning, so LLMLingua kicks them out.
LLM-based relevance assessment
Another very straight-forward approach is to let the LLM be its own critic. We can prompt the LLM and another layer of LLM calling to evaluate the relevance of retrieved documents before it generates its final response. This self-critique process helps filter out irrelevant content, ensuring the LLM focuses on what’s really important for answering the query.
By implementing these context selection techniques, we’re essentially giving our LLM a pair of noise-cancelling headphones and a highlighter. We’re helping it focus on what’s truly important, leading to more accurate, relevant, and insightful responses. In the world of RAG, sometimes less really is more!