Page6/9
Building & Deploying LLM Applications Β· Page 1 of 1
LLM Application Architecture
Building LLM Applications
Basic LLM Application Pattern
User Input
β
Process/Validate Input
β
Send to LLM (with prompt engineering)
β
Process Output
β
Return to User
Using LLM APIs
Most developers use APIs instead of running models locally:
Services:
- OpenAI API (GPT-4, GPT-3.5)
- Anthropic API (Claude)
- Google PaLM API
- Open-source: HuggingFace Inference API
Benefits:
- No need to run models locally (expensive!)
- Latest model versions
- Pay per use
- Handled scaling & infrastructure
Building a Chatbot
Architecture:
1. User sends message
2. Store in conversation history
3. Send history + new message to LLM
4. Store response in history
5. Return response to user
Key: Keep conversation context!
Conversation Memory
Without memory:
User: "My name is Alice"
LLM: "Nice to meet you Alice"
User: "What's my name?"
LLM: "I don't know"
With memory (context window):
User: "My name is Alice"
Context: "My name is Alice"
User: "What's my name?"
Context: "[previous messages], User: 'What's my name?'"
LLM: "Your name is Alice"
LLM Agents (Autonomous Behaviors)
An agent uses an LLM plus tools:
Tools available:
- Calculator
- Web search
- Database query
- Send email
Agent loop:
1. User asks question
2. LLM decides which tool to use
3. Tool executes
4. LLM gets result
5. Returns answer or asks for more tools
Example:
User: "What's the weather and stock price of Apple?"
LLM: "I need weather API and stock API"
β Calls weather API
β Calls stock API
β Combines results: "Weather is sunny, Apple stock is $150"
Cost Optimization
Expensive:
- Sending entire conversation history every request
- Using expensive models for simple tasks
- Running models locally
Optimized:
- Cache common prompts
- Use cheaper models (GPT-3.5 vs GPT-4)
- Use RAG instead of fine-tuning
- Batch requests
- Use open-source models (LLaMA) for simple tasks
Handling Limitations
Hallucinations (Making Stuff Up)
LLM: "Einstein won the Nobel Prize in Physics in 1921"
Reality: True! (just checking you know)
LLM: "Python's creator is Steve Jobs"
Reality: False! (Guido van Rossum) - This is hallucination
Solutions:
1. Fact-check with RAG/knowledge base
2. Ask model for sources
3. Use smaller models trained for factuality
4. Fine-tune on factual data
Token Limits
GPT-4: 8K-128K tokens per request
Claude 3: Up to 200K tokens
Solutions:
- Summarize long conversations
- Keep relevant context only
- Use embeddings to find important parts
Latency
Problem: API calls can be slow
- OpenAI API: 1-10 seconds
- Local models: 100-500ms
Solutions:
- Cache common requests
- Use streaming (show response as it comes)
- Queue requests if needed
Deployment Patterns
Pattern 1: Hosted API
- Send requests to provider (OpenAI, Anthropic)
- Simplest, no infrastructure
- Pay per request
Pattern 2: Self-hosted open model
- Run model on your servers
- Full control, privacy
- Need GPU resources
Pattern 3: Hybrid
- Critical paths use local models
- General queries use APIs
- Best of both
main.py
Loading...
OUTPUT
βΆClick "Run Code" to executeβ¦