This a project I'm working on right now, I'm trying to compile a list of questions and answers for Generative AI interviews.
I'm using this reference as the base, credit to them for compiling it, however, I am taking a lot of liberties with editing the questions, as well as the answers, they are completely my own.
Note: I'm trying to keep the answers I write myself to a minimum, since I am in no way or form an authoritative source on this topic. I will be providing references to the best of my ability. I refrained from adding any sort of visual aid for readability and to keep the complexity of maintenance to a minimum. The resources and references I cite contain a wealth of information, mostly with visuals.
I plan to expand this to Generative AI in general, not just for language, covering everything from diffusion models up to vision-language models. Once I get the basic structure down and I'm happy with the preliminary results, I will work on establishing an efficient methodology for contributing to this repository, and then I will open it up for contributions, but for now, I want to keep it simple and focused.
Important:
I think it might be necessary to clarify that the answers I provide, regardless if they are my own writeup or if I'm citing a source, are not in any way or form definitive, what I'm trying to do is get you started on the right path, and give you a general idea of what to expect, you should definitely read any and all resources I provide, and then some. If you want this to be your last stop, this is the wrong place for you. This is where it starts.
Also, if you're just getting started, my one and only piece of advice is:
Get comfortable reading papers, because they never end.
Paper on How to Read a Paper: How to Read a Paper
2. Retrieval Augmented Generation (RAG)
4. Embedding Models for Retrieval
5. Vector Retrieval, Databases and Indexes
7. Language Model Internal Workings
Let's consider a dataset, where each data point represents a cat. Let's pass it through to each type of model, and see how they differ:
- Predictive/Discriminative Model: This type of model will either try to discriminate between different data points, by virtue of a decision boundary, and a mapping function, or by predicting the probability of a class label given the model's ability to learn patterns from the data.
- Generative Model: This type of model will try to generate new data points, by learning the underlying distribution of the data given all the data points in the dataset.
Let's build the definition of a Large Language Model (LLM) from the ground up:
- Language Models (LMs): These are probabilistic models that learn from natural language, they can range from simple bayesian models to more complex neural networks.
- Large: Consider all the data available, a literal dump of the internet, enough compute to process and learn from it, and a model that can handle the complexity of the data.
- Large Language Model: The amalgamation of the two, a model that can learn from a large amount of data, and generate text that is coherent and human-like.
Further reading: Common Crawl
Large Language Models are often trained in multiple stages, these stages are often named pretraining, fine-tuning, and alignment.
The purpose of this stage is to expose the model to all of language, in an unsupervised manner, it is often the most expensive part of training, and requires a lot of compute. Pretraining is often done on something like the Common Crawl dataset, processed versions of the dataset such as FineWeb and RedPajama are often used for pretraining. To facilitate this broad type of learning, there exists multiple training tasks we can use, such as Masked Language Modeling (MLM), Next Sentence Prediction (NSP), and more.
Mask Language Modeling is based of the Cloze Test, where we mask out a word in a sentence, and ask the model to predict it. Similar to a fill in the blank test. It differs from asking the model to predict the next word in a sentence, as it requires the model to understand the context of the sentence, and not just the sequence of words.
Next Sentence Prediction is a task where the model is given two sentences, and it has to predict if the second sentence follows the first one. As simple as it sounds, it requires the model to understand the context of the first sentence, and the relationship between the two sentences.
An excellent resource to learn more about these tasks is the BERT paper.
This stage is much simpler than pretraining, as the model has already learned a lot about language, and now we just need to teach it about a specific task. All we need for this stage is the input data (prompts) and the labels (responses).
This stage is often the most crucial and complex stage, it requires the use of separate reward models, the use of different learning paradigms such as Reinforcement Learning, and more.
This stage mainly aims to align the model's predictions with the human's preferences. This stage often interweaves with the fine-tuning stage. Essential reading for this stage is the InstructGPT paper, this paper introduced the concept of Reinforcement Learning from Human Feedback (RLHF) which uses Proximal Policy Optimization.
Other methods of Aligning the model's predictions with human preferences include:
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- SimPO: Simple Preference Optimization with a Reference-Free Reward
- ORPO: Monolithic Preference Optimization without Reference Model
- KTO: Model Alignment as Prospect Theoretic Optimization
Tokens are the smallest unit of text that the model can understand, they can be words, subwords, or characters.
Tokenizers are responsible for converting text into tokens, they can be as simple as splitting the text by spaces, or as complex as using subword tokenization. The choice of tokenizer can have a significant impact on the model's performance, as it can affect the model's ability to understand the context of the text.
Some common tokenizers include:
Recommended reading (and watching):
How do you estimate the cost of running an API based / closed source LLM vs. an open-source LLMs (self-hosted)?
This is a very loaded question, but here are some resources to explore this topic further:
- Using LLMs for Enterprise Use Cases: How Much Does It Really Cost?
- The Challenges of Self-Hosting Large Language Models
- The Case for Self-Hosting Large Language Models
- Exploring the Differences: Self-hosted vs. API-based AI Solutions
Parameters include:
- Temperature
- Top P
- Max Length
- Stop Sequences
- Frequency Penalty
- Presence Penalty
Each of these parameters can be tuned to improve the performance of the model, and the quality of the generated text.
Recommended reading:
- How to tune LLM Parameters for optimal performance
- OpenAI Documentation
- LLM Settings by Prompt Engineering Guide
Decoding strategies are used to pick the next token in the sequence, they can range from simple greedy decoding to more complex sampling strategies.
Some common decoding strategies include:
- Greedy Decoding
- Beam Search
Newer decoding strategies include Speculative Decoding (assisted decoding) which is a wild concept, it involves using a candidate tokens from a smaller (thus faster) model to generate a response from a bigger model very quickly.
Recommended reading:
- Text generation strategies by HuggingFace
- Speculative Decoding Paper
- A Hitchhiker’s Guide to Speculative Decoding by Team PyTorch at IBM
In the decoding process, LLMs autoregressively generate text one token at a time. There are several stopping criteria that can be used to determine when to stop generating text. Some common stopping criteria include:
- Maximum Length: Stop generating text when the generated text reaches a certain length.
- End of Sentence Token: Stop generating text when the model generates an end of sentence token.
- Stop Sequences: Stop generating text when the model generates a predefined stop sequence.
A prompt contains any of the following elements:
Instruction - a specific task or instruction you want the model to perform
Context - external information or additional context that can steer the model to better responses
Input Data - the input or question that we are interested to find a response for
Output Indicator - the type or format of the output.
Reference: Prompt Engineering Guide
Recommended reading:
- Zero-shot Prompting
- Few-shot Prompting
- Chain-of-Thought Prompting
- Self-Consistency
- Generate Knowledge Prompting
- Prompt Chaining
- Tree of Thoughts
- Retrieval Augmented Generation
- ReAct
Reference: Prompt Engineering Guide
Recommended reading:
In-context learning is a very intuitive and easy to understand learning paradigm in Natural Language Processing. It encompasses concepts such as few-shot learning. It can be as easy as providing a few examples of the task you want the model to perform, and the model will learn from those examples and generate responses accordingly.
Recommended Reading:
- Towards Understanding In-context Learning from COS 597G (Fall 2022): Understanding Large Language Models at Princeton University
- Understanding In-Context Learning
It has been shown that In-context Learning can only emerge when the models are scaled to a certain size, and when the models are trained on a diverse set of tasks. In-context learning can fail when the model is not able to perform complex reasoning tasks.
Recommended Reading:
This is a very broad question, but the following will help you form a basic understanding of how to design prompts for a specific task:
- Best Practices for Prompt Engineering from OpenAI
- General Tips for Designing Prompts
- Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4
- Prompting Fundamentals and How to Apply them Effectively by Eugene Yan (Amazon)
Alternatively, newer research directions investigate using an algorithmic way of optimizing the prompts, this has been explored extensively in the DSPy package, which provides the means to do this, their work is also published in this paper.
There is no answer for this question, I put it as an excuse to link this reference:
Suppose you want an LLM to output JSON or YAML or any other structure, how would you ensure the output is parsable?
There are multiple methods to get LLMs to generate structured outputs that are parsable every time, common methods depends on the concept of function calling in LLMs.
Recommended Reading and Viewing:
The term describes when LLMs produce text that is incorrect, makes no sense, or is unrelated to reality
Reference: LLM Hallucination—Types, Causes, and Solution by Nexla
Recommended Reading:
- Hallucination (Artificial Intelligence) - Wikipedia
- Hallucination is Inevitable: An Innate Limitation of Large Language Models
- A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The concept of Chain-of-Thought Prompting is known to enhance reasoning capabilities in LLMs. This technique involves breaking down a complex task into a series of simpler tasks, and providing the model with the intermediate outputs of each task to guide it towards the final output.
Recommended Reading:
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Chain-of-Thought Prompting by Prompt Engineering Guide
Retrieval Augmented Generation (RAG) is a common design pattern for grounding LLM answers in facts. This technique involves retrieving relevant information from a knowledge base and using it to guide the generation of text by the LLM.
Recommended Reading:
- Patterns for Building LLM-based Systems & Products by Eugene Yan (Amazon)
- LLM From the Trenches: 10 Lessons Learned Operationalizing Models at GoDaddy
- What We’ve Learned From A Year of Building with LLMs
Retrieval Augmented Generation (RAG) is composed of two main component:
- A retriever: This component is responsible for retrieving relevant information from a knowledge base given a query.
- A generator: This component is responsible for generating text based on the retrieved information.
The intuition behind RAG is that by combining the strengths of retrieval-based and generation-based models, we can create a system that is capable of generating text that is grounded in facts, thus limiting hallucination.
RAG is often the go-to technique for answering complex questions based on a knowledge base, as it allows the model to leverage external information to provide more accurate and informative answers. It is not always feasible to fine-tune a model on proprietary data, and RAG provides a way to incorporate external knowledge without the need for fine-tuning.
Provide a high-level overview of the steps involved in implementing a full solution that utilizes RAG to answer a complex question based on a knowledge base
A full solution that utilizes RAG to answer a complex question based on a knowledge base would involve the following steps:
- Data Ingestion: documents or data streams that compromise the knowledge base are ingested into a data pipeline and processed in a way that is suitable for retrieval.
- Indexing: the processed data is indexed in a way that allows for efficient retrieval.
- Retrieval: given a query, the retriever component retrieves relevant information from the knowledge base.
- Generation: the generator component generates text based on the retrieved information.
- Post-processing: the generated text is post-processed to ensure factuality and integrity.
This is a very loaded question, but here are some resources to explore this topic further:
- Fine-Tuning vs. Retrieval Augmented Generation (RAG): Tailoring Large Language Models to Your Needs
- Enhancing LLMs with Retrieval Augmented Generation
- Post by Justin Zhao, Founding Engineer @ Predibase
- Musings on building a Generative AI product - Linkedin Engineering
- RAG vs. Fine-tuning by Armand Ruiz, VP of Product, AI Platform @ IBM
- Patterns for Building LLM-based Systems & Products by Eugene Yan (Amazon)
- What We’ve Learned From A Year of Building with LLMs
- LLM From the Trenches: 10 Lessons Learned Operationalizing Models at GoDaddy
Chunking text is the process of breaking down a large piece of text into smaller, more manageable chunks. In the context of RAG systems, chunking is important because it allows the retriever component to efficiently retrieve relevant information from the knowledge base. By breaking down the query into smaller chunks, the retriever can focus on retrieving information that is relevant to each chunk, which can improve the accuracy and efficiency of the retrieval process.
During the training of embedding models, which are often used as retrievers, positive and negative pairs of text are used to indicate what pieces of text correspond to each other, examples include the titles, headers and subheaders on a Wikipedia page, and their corresponding paragraphs, reddit posts and their top voted comments, etc.
A user query is often embedded, and an index is queried, if the index had entire documents contained within it to be queried for top-k hits, a retriever would not be able to return the most relevant information, as the documents to be queried would be too large to comprehend.
To summarize, we chunk text because:
- We need to break down large pieces of text into smaller, more manageable chunks, where we ideally wish to have each chunk contain defined pieces of information we can query.
- Embedding models often have fixed context lengths, we cannot embed an entire book.
- Intuitively, when we search for information, we know the book we want to use as reference (corresponding to an index here), we'd use chapters and subchapters (our chunks) to find the information we need.
- Embedding models compress semantic information into a lower dimensional space, as the size of the text increases, the amount of information that is lost increases, and the model's ability to retrieve relevant information decreases.
Suppose we have a book, containing 24 chapters, a total of 240 pages. This would mean that each chapter contains 10 pages, and each page contains 3 paragraphs. Let's suppose that each paragraph contains 5 sentences, and each sentence contains 10 words. In total, we have: 10 * 5 * 3 * 10 = 1500 words per chapter. We also have 1500 * 24 = 36000 words in the entire book. For simplicity, our tokenizer is a white space tokenizer, and each word is a token.
We know that at most, we have an embedding model capable of embedding 8192 tokens:
- If we were to embed the entire book, we would have to chunk the book into 5 chunks. Each chunk would contain 5 chapters, with 7200 tokens. This is a tremendous amount of information to embed, and the model would not be able to retrieve relevant information efficiently.
- We can embed each chapter individually, this would mean that each chapter would yield 1500 tokens, which is well within the model's capacity to embed. But we know that chapters contain multiple topics, and we would not be able to retrieve the most relevant information.
- We can embed each page, resulting in 450 tokens per page, this is a good balance between the two extremes, as pages often contain a single topic, and we would be able to retrieve the most relevant information, however, what if the information we need is spread across multiple pages?
- We can embed each paragraph individually, this would mean that each paragraph would yield 150 tokens, which is well within the model's capacity to embed. TParagraphs often contain a single topic, and we would be able to retrieve the most relevant information, however, what if the flow of information is not linear, and the information we need is spread across multiple paragraphs, and we need to aggregate it?
- We can embed each sentence individually, but here we risk losing the context of the paragraph, and the model would not be able to retrieve the most relevant information.
All of this is to illustrate that there is no fixed way to chunk text, and the best way to chunk text is to experiment and see what works best for your use case.
What are the different chunking strategies used in RAG systems and how do you evaluate your chunking strategy?
An authoritative source on this topic is the excellent notebook and accompanying video by Greg Kamradt, in which they explain the different levels of text splitting.
The notebook also goes over ways to evaluate and visualize the different levels of text splitting, and how to use them in a retrieval system.
Recommended Viewing:
- ChunkViz: A Visual Exploration of Text Chunking
- RAGAS: An Evaluation Framework for Retrieval Augmented Generation
Vector embeddings are the mapping of textual semantics into an N-dimensional space where vectors represent text, within the vector space, similar text is represented by similar vectors.
Recommended Reading:
Embedding models are Language Models trained for the purpose of vectorizing text, they are often BERT derivatives, and are trained on a large corpus of text to learn the semantics of the text, recent trends however also show it is possible to use much larger language models for this purpose such as Mistral or Llama.
Recommended Reading and Viewing:
- Quickstart for SentenceTransformers
- llm2vec
- NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Embedding models are often used as retrievers, to utilize their retrieval capabilities, semantic textual similarity is used where in vectors produced by the models are measured in similarity using metrics such as dot product, cosine similarity, etc.
- Dense Multi-vectors (e.g. ColBERT)
- Dense Single-vectors (e.g. BERT with Pooling)
- Sparse Single-vectors (e.g. BM25 or SPLADE)
Recommended Reading:
- Okapi BM25
- Text Embeddings by Weakly-Supervised Contrastive Pre-training
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking
- BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Embeddings models are trained with contrastive loss, ranging from simple contrastive loss and up to more complex loss functions such as InfoNCE and Multiple Negative Ranking Loss. A process known as hard negative mining is also utilized during training as well.
Recommended Reading:
- Contrastive Representation Learning by Lilian Weng (OpenAI)
- Representation Learning with Contrastive Predictive Coding
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- SentenceTransformer Losses Documentation
- Hard Negative Mining Used by BGE Text Embedding Models
Contrastive learning is a technique used to train embedding models, it involves learning to differentiate between positive and negative pairs of text. The model is trained to maximize the similarity between positive pairs and minimize the similarity between negative pairs.
Recommended Reading:
- SentenceTransformers Losses
- Contrastive Representation Learning by Lilian Weng (OpenAI)
- Representation Learning with Contrastive Predictive Coding
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- SentenceTransformer Losses Documentation
- Hard Negative Mining Used by BGE Text Embedding Models
Cross-encoders and bi-encoders are two types of models used for text retrieval tasks. The main difference between the two is how they encode the query and the document.
Rerankers are usually cross-encoders, they encode the query and the document together, and calculate the similarity between the two. This allows them to capture the interaction between the query and the document, and produce better results than bi-encoders at the cost of much higher computational complexity.
Text embedding models are usually bi-encoders, they encode the query and the document separately, and calculate the similarity between the two embeddings. This allows them to be more computationally efficient than cross-encoders, but they are not able to capture the explicit interaction between the query and the document.
Single vector dense representations are often the norm in text embedding models, they're usually produced by pooling the contextualized embeddings after a forward pass from the model, pooling techniques include mean pooling, max pooling, and CLS token pooling. The intuition behind single vector dense representations is that they are simple to implement and can be used for a wide range of tasks, as well as ease of indexing and retrieval. Dense representations are also able to capture the semantics of the text, and are often used in second stage ranking.
Multi vector dense representations have shown to produce superior results to single vector dense representations, they are produced by skipping the pooling step and using the contextualized embeddings in the form of a matrix, the query and document embeddings are then used to calculate the similarity between the two, models such as ColBERT have shown to produce superior results to single vector dense representations. An operator such as MaxSim is used to calculate the similarity between the query and document embeddings. The intuition behind multi vector dense representations is that they are able to capture more information about the text, and produce better results than single vector dense representations, models such as ColBERT also offer the ability to precompute document embeddings, allowing for very efficient retrieval. Dense representations are also able to capture the semantics of the text, and are often used in second stage ranking.
Recommended Reading:
Sparse text representations are the oldest form of vector space models in information retrieval, they are usually based on TF-IDF derivatives and algorithms such as BM25, and remain a baseline for text retrieval systems. Their sparsity stems from the fact that the dimension of the embeddings often corresponds to the size of the vocabulary. The intuition behind sparse representations is that they are explainable, computationally efficient, easy to implement and extremely efficient for indexing and retrieval. Sparse representation also focus on lexical similarity, and are often used in first stage ranking.
Recommended Reading:
Sparse text embeddings allow for the use of inverted indices during retrieval.
Recommended Reading:
Metrics for benchmarking the performance of an embedding model include:
- Recall
- Precision
- Hit Rate
- Mean Reciprocal Rank (MRR)
- Normalized Discounted Cumulative Gain (NDCG)
Recommended Reading and Viewing:
- Evaluation measures (information retrieval)
- Introduction to Information Retrieval, Chapter 8
- Evaluation Metrics for Search and Recommendation Systems
- ir-measures
Picking an embedding model could be a pivotal factor in the performance of your retrieval system, and careful consideration should be taken when choosing one. It is a broad process that involves experimentation, and the following resources will help you make an informed decision:
Recommended Viewing:
- Massive Text Embedding Benchmark (MTEB) Leaderboard
- Evaluation measures (information retrieval)
- Introduction to Information Retrieval, Chapter 8
- Evaluation Metrics for Search and Recommendation Systems
A vector database is a database that is optimized for storing and querying vector data. It allows for efficient storage and retrieval of vector embeddings, and is often used in applications that require semantic similarity search. Vector databases are a new paradigm that has emerged as part of the tech stack needed to keep up with the demands of GenAI applications.
Recommended Viewing:
Traditional databases are optimized for storing and querying structured data, such as text, numbers, and dates. They are not designed to handle vector data efficiently. Vector databases, on the other hand, are specifically designed to store and query vector data. They use specialized indexing techniques and algorithms to enable fast and accurate similarity search such as quantization and clustering of vectors.
A vector database usually contains indexes of vectors, these indexes contain matrices of vector embeddings, often a graph data structure is used as well, ordered in such a way that they can be queried efficiently. When a query is made, either text or a vector embedding is provided as input, in the case of text, it is embedded, and the vector database will query the appropriate index to retrieve the most similar vectors based on distance metrics. Usually, the vectors are compared using metrics such as cosine similarity, dot product, or Euclidean distance. Vectors also relate to a dictionary of metadata that could contain information such as the document ID, the document title, the corresponding text and more.
Search strategies in vector databases include:
- Exhaustive Search: This strategy involves comparing the query vector with every vector in the database to find the most similar vectors.
- Approximate Search: This strategy involves using approximate algorithms such as Hierarchal Navigable Small Worlds (HNSW) to find the most similar vectors. At the time of indexing, a graph is built, and the query vector is traversed through the graph to find the most similar vectors.
Recommended Reading:
- Foundations of Vector Retrieval by Sebastian Bruch, Part II Retrieval Algorithms
- Approximate Near Neighbor Search
- Faiss Index Wiki
- Hierarchical Navigable Small Worlds
Explain the concept of clustering in the context of vector retrieval, and its relation to the search space
Once the vectors are indexed, they are often clustered to reduce the search space, this is done to reduce the number of vectors that need to be compared during the search process. Clustering is done by grouping similar vectors together, and then indexing the clusters. When a query is made, the search is first performed at the cluster level, and then at the vector level within the cluster. Algorithms such as K-means are often used for clustering.
Recommended Reading:
- UNum USearch Documentation
- K-means Clustering
- Faiss Index Wiki
- Foundations of Vector Retrieval by Sebastian Bruch, Part II Retrieval Algorithms, Chapter 7 Clustering
- Hierarchical Navigable Small Worlds
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
Compare between various vector databases and indexes based on a set of criteria, and decide which one is best for a given use case
This is obviously a very loaded question, but here are some resources to explore this topic further:
Vector quantization, also called "block quantization" or "pattern matching quantization" is often used in lossy data compression. It works by encoding values from a multidimensional vector space into a finite set of values from a discrete subspace of lower dimension.
Reference: Vector Quantization
One general approach to LSH is to “hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are.
Reference: Mining of Massive Datasets, 3rd Edition, Chapter 3, Section 3.4.1
Recommended Reading:
In short, PQ is the process of:
- Taking a big, high-dimensional vector,
- Splitting it into equally sized chunks — our subvectors,
- Assigning each of these subvectors to its nearest centroid (also called reproduction/reconstruction values),
- Replacing these centroid values with unique IDs — each ID represents a centroid
Reference: Product Quantization
Recommended Reading:
The Inverted File Index (IVF) index consists of search scope reduction through clustering.
Reference: Inverted File Index
Recommended Reading:
Explain the concept of Hierarchical Navigable Small Worlds (HNSW) within the context of vector retrieval
Hierarchical Navigable Small Worlds (HNSW) is often considered the state-of-the-art in vector retrieval, it is a graph-based algorithm that builds a graph of the vectors, and uses it to perform approximate nearest neighbor search.
Recommended Reading:
- Hierarchical Navigable Small Worlds - Wikipedia
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
- Hierarchical Navigable Small Worlds - Faiss: The Missing Manual
Distance and similarity metrics used in vector retrieval include:
- Euclidean Distance
- Dot Product
- Cosine Similarity
- Pearson Correlation Coefficient
- Jaccard Similarity
- Dice-Sørensen Coefficient
- Hamming Distance
- Haversine Distance
Recommended Viewing:
Explain some of the architectures and patterns in retrieval systems and semantic search in the context of LLM systems
This is a very active research topic, and no authoritative source exists, but here are some resources to explore this topic further:
- Retrieval-Augmented Generation for Large Language Models: A Survey
- Seven Failure Points When Engineering a Retrieval Augmented Generation System
- Levels of Complexity: RAG Applications
- Beyond the Basics of Retrieval for Augmenting Generation (w/ Ben Clavié)
It is also worth noting that search, retrieval and reranking systems are built on established patterns and architectures in the fields of information retrieval, recommendation systems, and search engines.
Some system architectures you might want to explore include:
Achieving good search in large-scale systems involves a combination of efficient indexing, retrieval, and ranking techniques. Some strategies to achieve good search in large-scale systems include:
- Using a lightweight retriever to generate a set of candidate documents algorithms such as BM25 and keyword search are often used for this purpose.
- Using a more complex retriever to generate a set of candidate documents, models such as ColBERT, ColPali and BGE-M3 are often used for this purpose, current state of the art leans towards multi-vector (late interaction) mechanisms for dense retrieval.
- Using a reranker to re-rank the candidate documents generated by the retriever, models such as mixedbread-ai/mxbai-rerank-large-v1 and BAAI/bge-reranker-large are often used for this purpose.
- Combining the results of multiple retrieval methodologies such as dense and sparse using Reciprocal Rank Fusion
- Implement query understanding and decomposition to break down complex queries into simpler sub-queries, and then recombine the results.
- Use metadata to filter results, no system works with just semantic and vector similarity, metadata is often used to filter results.
- Train a Learning-to-Rank model to optimize the ranking of the search results for your use case, this is often done using a combination of features such as BM25 scores, vector similarity scores, and metadata.
You might notice that the entire process is done in phases of increasing complexity, this is known as phased ranking or multistage retrieval.
Recommended Reading:
- Phased Ranking by Vespa
- Semantic Models for the First-stage Retrieval: A Comprehensive Review
- Information Retrieval Meets Large Language Models: A Strategic Report from Chinese IR Community
But the most important aspect of achieving good search in large-scale systems is to experiment and iterate on your retrieval and ranking strategies, and to continuously monitor and evaluate the performance of your system.
Recommended Reading:
- Evaluation measures (information retrieval)
- Introduction to Information Retrieval, Chapter 8
- Evaluation Metrics for Search and Recommendation Systems
- ir-measures
- Good Authority Figure on IR, one Tweet as a starting point
Recommended talks about improving search, retrieval and RAG systems:
Achieving fast searching involves optimizing the indexing and retrieval process, which takes non-trivial engineering effort, the following are some examples of the current landscape in the field of search and retrieval optimization:
- Binary Quantization - Vector Search, 40x Faster
- Matryoshka 🤝 Binary vectors: Slash vector search costs with Vespa
- Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval
Current state of the art in vector retrieval indicates multi-vector embeddings (late interaction) perform better than single vector embeddings, however, optimizing their retrieval poses a significant engineering challenge, the following discusses multi-vector embeddings and their retrieval in-depth:
- Scaling ColPali to billions of PDFs with Vespa
- PLAID: An Efficient Engine for Late Interaction Retrieval
- Efficient Multi-Vector Dense Retrieval with Bit Vectors
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document.
Reference: BM25
Reranking models are sequence classification models trained to take a pair of query and documents, and output raw similiarity scores.
Recommended Reading, Viewing and Watching:
- Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline
- SentenceTransformers Documentation on Cross-Encoders
- Reranker Libarary
- Rerankers and Two-Stage Retrieval
- Training a Language Model for Reranking (RankZephyr)
Evaluating RAG systems requires experimenting with and evaluating the individual components of the system, such as the retriever, generator, and reranker.
Recommended Reading:
- RAGAS: An Evaluation Framework for Retrieval Augmented Generation
- Evaluating Retrieval Augmented Generation using RAGAS
- Evaluating Retrieval Augmented Generation - a framework for assessment
- Evaluation measures (information retrieval)
- Introduction to Information Retrieval, Chapter 8
- Evaluation Metrics for Search and Recommendation Systems
- ir-measures
- Good Authority Figure on IR, one Tweet as a starting point
Note: from here on, I'll refrain from answering as much as I can, and just link papers and references, this part is arguably one of the more complex parts, so it requires a lot of reading and understanding.
To understand attention, you'll need to be familiar with the Transformer architecture, and their predecessor architectures. Here are some resources to get you started:
- Neural Machine Translation by Jointly Learning to Align and Translate
- An Introductory Survey on Attention Mechanisms in NLP Problems
- Attention is All You Need
- Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch
- The Illustrated Transformer
- The Annotated Transformer
- Build a Large Language Model (From Scratch)
- Transformer Taxonomy
The main bottleneck of self-attention is its quadratic complexity with respect to the sequence length. To understand the disadvantages of self-attention, you'll need familiarize yourself with attention alternatives, the following will help you get started:
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- A Visual Guide to Mamba and State Space Models
- RWKV: Reinventing RNNs for the Transformer Era
- Hyena Hierarchy: Towards Larger Convolutional Language Models (Paper)
- Hyena Hierarchy: Towards Larger Convolutional Language Models
- Hungry Hungry Hippos: Towards Language Modeling with State Space Models
There are multiple ways to encode positional information in LLMs, the most common way is to use sinusoidal positional encodings, known as absolute positional encodings. Other methods include relative positional encodings, and newer methods such as Rotary Positional Embeddings. Here are some resources to get you started:
- Transformer Architecture: The Positional Encoding
- Understanding positional encoding in Transformers
- RoFormer: Enhanced Transformer with Rotary Position Embedding
To understand KV Cache, you'll need to be familiar with the Transformer architecture and its limitations.
Recommended Reading:
- Transformers Optimization: Part 1 - KV Cache
- Transformer Inference Arithmetic
- Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
- MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
- GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
Mixture of experts is a type of architecture in LLMs, to understand how it works, you should go through the following resources, which cover the most prominent MoE models:
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Mixtral of Experts
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
- Introducing DBRX: A New State-of-the-Art Open LLM
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
- MegaBlocks
- Grok-1
- Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints