Topics
Improving the quality, relevance, style, and consistency of Large Language Model (LLM) outputs often involves techniques like prompt engineering, RAG, and LLM fine-tuning.
Prompt Engineering
This involves carefully crafting the input prompt given to the LLM to guide its response generation process effectively.
- Priming: Setting the context or persona (e.g., “You are a helpful assistant specializing in topic X”)
- Style/Tone: Explicitly instructing the desired output style (e.g., “Use layman terms,” “Respond formally”)
- Error Handling: Defining how the LLM should behave with edge cases or irrelevant inputs (e.g., “If the question is off-topic, politely decline”)
- Dynamic Content: Incorporating user input or variables into the prompt structure
- Output Formatting: Specifying the desired output structure (e.g., “Respond in valid JSON format:
{'response': '...'}
”)
Retrieval Augmented Generation (RAG)
RAG enhances LLM responses by providing relevant, up-to-date external knowledge directly within the prompt context. This combats hallucination and grounds the response in specific data.
- Preparation:
- Collect relevant documents (corpus)
- Split documents into manageable, meaningful chunks
- Generate vector embeddings for each chunk using an embedding model
- Store chunks and their embeddings in a vector database for efficient searching
- Retrieval Process:
- Embed the user’s query using the same embedding model
- Search the vector database for the top N chunks most similar (semantically relevant) to the query embedding
- Construct an augmented prompt including the original query and the retrieved knowledge chunks (e.g., “Answer the query
'{inquiry}'
using this information if relevant:'{knowledge}'
”) - Feed the augmented prompt to the LLM to generate the final response
- Advanced RAG: vanilla RAG problems, so few tips for better RAG performance:
- Query Pre-processing: Use an LLM to refine or simplify the user query before embedding
- Filtering: After initial retrieval, use an LLM to assess which retrieved chunks are most applicable to the specific query
- Self-Reflection: After generation, ask the LLM (can be same or different) to evaluate its own answer for accuracy and helpfulness, potentially rewriting it if needed
Fine Tuning LLMs
Fine-tuning involves further training a pre-trained foundational LLM on a dataset of specific prompt-completion examples relevant to a particular task or domain.
- Use Cases:
- Teaching nuanced tasks or intuition difficult to capture in prompt instructions alone
- Consistently enforcing a specific style, tone, or format (baking it into the model)
- Reducing the length and complexity of prompts needed during inference
- Training smaller, more specialized models to perform well on specific tasks, optimizing for speed/cost
- Constraining the model’s output to a narrower, desired range
- Strategies:
- Quality Focus: Fine-tune a larger, more capable base model on high-quality examples
- Speed/Cost Focus: Fine-tune a smaller base model on a larger dataset of examples
Note
A popular technique named few-shot prompting involves providing examples directly within the prompt’s context window. In fine-tuning, we move these examples into a training dataset, which scales better for numerous scenarios and avoids increasing prompt length and cost during inference.
Combining Retrieval and Tuning
Using RAG with a fine-tuned model often yields the best results. The fine-tuning helps the model understand the task, style, and format implicitly, while RAG provides the necessary external knowledge dynamically at inference time.