Page4/9

Retrieval-Augmented Generation (RAG) · Page 1 of 1

RAG Concepts & Architecture

Retrieval-Augmented Generation (RAG)

The Problem: Knowledge Cutoff

LLMs are trained on data up to a certain date:

GPT-4 trained until April 2023
After that date: Model doesn't know

User: "What happened on May 1, 2024?"
GPT-4: "I don't know, my training ended before that."

The Solution: RAG

Instead of retraining the model:

Store your documents in a database
When user asks a question:
- Search documents for relevant info
- Give relevant info to LLM
- LLM answers based on context

User: "What are the new policies?"

System:
  1. Search knowledge base for "policies"
  2. Find: "New policy document from May 2024"
  3. Pass to LLM: "Based on this document: [text]... Answer: what are the new policies?"
  4. LLM: "The new policies are..."

Why RAG is Powerful

Approach	Cost	Speed	Freshness	Accuracy
Fine-tuning	$$$$$	Slow	Days	Good
RAG	$$	Fast	Real-time	Excellent
No knowledge	$	Very Fast	N/A	Poor

RAG Architecture

Documents → Vector Database
  ↑
  |
User Question → Embedding → Search → Top K Results → LLM → Answer

Step 1: Vectorize Documents

Document: "Python is a programming language"
Embedding: [0.2, -0.5, 0.8, 0.1, ...] (1024D vector)

Embedding captures semantic meaning!
Similar documents → Similar embeddings

Step 2: Search for Relevant Documents

User question: "How do I learn Python?"
Question embedding: [0.25, -0.48, 0.75, 0.12, ...] (similar to "Python document"!)

Search: Find K documents with highest similarity

Step 3: Pass to LLM with Context

System message: "Use these documents to answer:"
Documents: [retrieved documents]
User question: "How do I learn Python?"

LLM: [answers based on context]

Practical Example: Customer Support Bot

Company stores:
- Product manuals
- FAQs
- Support tickets
- Policies

Customer: "How do I return an item?"
RAG system:
  1. Search knowledge base → Find return policy
  2. Pass to LLM with policy
  3. LLM: "According to our policy: [details] Steps: [steps]"

Vector Databases (Tools)

Tool	Best For
Pinecone	Managed, easy
Weaviate	Open source, flexible
Milvus	Scalable, enterprise
Chroma	Local/small projects
Qdrant	Performance

RAG vs Fine-tuning

Use RAG when:
- Knowledge changes frequently
- Need up-to-date info
- Multiple knowledge sources
- Quick implementation needed

Use Fine-tuning when:
- Want to change model behavior/style
- Need performance optimization
- Training data is stable
- Cost not a concern

main.py

OUTPUT

▶Click "Run Code" to execute…