6/9
Building & Deploying LLM Applications Β· Page 1 of 1

LLM Application Architecture

Building LLM Applications

Basic LLM Application Pattern

User Input
    ↓
Process/Validate Input
    ↓
Send to LLM (with prompt engineering)
    ↓
Process Output
    ↓
Return to User

Using LLM APIs

Most developers use APIs instead of running models locally:

Services:
- OpenAI API (GPT-4, GPT-3.5)
- Anthropic API (Claude)
- Google PaLM API
- Open-source: HuggingFace Inference API

Benefits:
- No need to run models locally (expensive!)
- Latest model versions
- Pay per use
- Handled scaling & infrastructure

Building a Chatbot

Architecture:
1. User sends message
2. Store in conversation history
3. Send history + new message to LLM
4. Store response in history
5. Return response to user

Key: Keep conversation context!

Conversation Memory

Without memory:
User: "My name is Alice"
LLM: "Nice to meet you Alice"
User: "What's my name?"
LLM: "I don't know"

With memory (context window):
User: "My name is Alice"
Context: "My name is Alice"
User: "What's my name?"
Context: "[previous messages], User: 'What's my name?'"
LLM: "Your name is Alice"

LLM Agents (Autonomous Behaviors)

An agent uses an LLM plus tools:

Tools available:
- Calculator
- Web search
- Database query
- Send email

Agent loop:
1. User asks question
2. LLM decides which tool to use
3. Tool executes
4. LLM gets result
5. Returns answer or asks for more tools

Example:
User: "What's the weather and stock price of Apple?"
LLM: "I need weather API and stock API"
  β†’ Calls weather API
  β†’ Calls stock API
  β†’ Combines results: "Weather is sunny, Apple stock is $150"

Cost Optimization

Expensive:
- Sending entire conversation history every request
- Using expensive models for simple tasks
- Running models locally

Optimized:
- Cache common prompts
- Use cheaper models (GPT-3.5 vs GPT-4)
- Use RAG instead of fine-tuning
- Batch requests
- Use open-source models (LLaMA) for simple tasks

Handling Limitations

Hallucinations (Making Stuff Up)

LLM: "Einstein won the Nobel Prize in Physics in 1921"
Reality: True! (just checking you know)

LLM: "Python's creator is Steve Jobs"
Reality: False! (Guido van Rossum) - This is hallucination

Solutions:
1. Fact-check with RAG/knowledge base
2. Ask model for sources
3. Use smaller models trained for factuality
4. Fine-tune on factual data

Token Limits

GPT-4: 8K-128K tokens per request
Claude 3: Up to 200K tokens

Solutions:
- Summarize long conversations
- Keep relevant context only
- Use embeddings to find important parts

Latency

Problem: API calls can be slow
- OpenAI API: 1-10 seconds
- Local models: 100-500ms

Solutions:
- Cache common requests
- Use streaming (show response as it comes)
- Queue requests if needed

Deployment Patterns

Pattern 1: Hosted API
- Send requests to provider (OpenAI, Anthropic)
- Simplest, no infrastructure
- Pay per request

Pattern 2: Self-hosted open model
- Run model on your servers
- Full control, privacy
- Need GPU resources

Pattern 3: Hybrid
- Critical paths use local models
- General queries use APIs
- Best of both
main.py
Loading...
OUTPUT
β–ΆClick "Run Code" to execute…