Llama.cpp Provider
Run GGUF-quantized LLaMA models locally via llama.cpp server with zero API costs, complete privacy, and offline capability.Overview
| Field | Details |
|---|---|
| Preset ID | llamacpp |
| Aliases | None |
| Default Profile Name | llamacpp |
| Default Model | llama3-8b |
| Base URL | http://127.0.0.1:8080 |
| Auth Method | Local (no API key needed) |
| Category | Recommended |
Quick Start
Prerequisites
Installing llama.cpp
Download Model
Download a GGUF-quantized model from Hugging Face:
Model Selection
Popular GGUF models for coding and general tasks:| Model | Size | Speed | Quality | RAM | Context |
|---|---|---|---|---|---|
| Phi-3 Mini (Q4) | 2.3GB | Very Fast | Good | 4GB | 4K |
| LLaMA 3 8B (Q4) | 4.7GB | Fast | Good | 8GB | 8K |
| LLaMA 3 70B (Q4) | 40GB | Slow | Excellent | 48GB+ | 8K |
| Mistral 7B (Q4) | 4.2GB | Very Fast | Good | 8GB | 8K |
| Qwen3 Coder (Q4) | 5.2GB | Fast | Excellent (coding) | 8GB | 128K |
Configuration
Default Setup
Custom Configuration
If llama.cpp server runs on different host/port:Starting llama.cpp Server
Basic Usage
Performance Tuning
Running Multiple Models
Usage Examples
Basic Chat
Model-Specific Usage
Streaming and Output Control
Troubleshooting
Connection Refused
Symptom:Error: connect ECONNREFUSED 127.0.0.1:8080
Causes & Solutions:
- Server not running — Start llama.cpp server in separate terminal
- Wrong port — Check server is on 8080 or update ANTHROPIC_BASE_URL
- Firewall blocked — Allow localhost connections
Model Not Found or Slow Response
Symptom: Invalid model or very slow responses Solutions:Out of Memory
Symptom:VRAM out of memory or process crash
Solutions:
- Reduce context — Limit with
-c 1024when starting server - Switch quantization — Use Q4_0 instead of Q5_K_M
- Reduce layers on GPU — Use
-ngl 30instead of-ngl 99 - Use smaller model — Switch from 70B to 8B variant
Port Already in Use
Symptom:Address already in use
Solutions:
Performance Optimization
GPU Acceleration
Memory Management
| Setting | Impact | Use When |
|---|---|---|
-ngl 99 | All layers on GPU | Plenty of VRAM (GPU-only inference) |
-ngl 30 | Partial offload | Limited VRAM (mixed CPU/GPU) |
-ngl 0 | CPU only | No GPU or testing |
-c 1024 | Small context | Limited RAM (<8GB) |
-c 2048 | Default context | 8-16GB RAM |
-c 4096 | Large context | 16GB+ RAM |
Batch Size and Threads
Cost Analysis
| Factor | Cost |
|---|---|
| API costs | $0 (free) |
| Hardware | GPU (optional): $200-800, or use CPU |
| Electricity | ~50-150W continuous (varies by model/GPU) |
| Privacy | Complete — data never leaves your machine |
| Availability | Offline-capable once model is downloaded |
Common Questions
Q: Can I use llama.cpp with older hardware? A: Yes, but it will be slow. CPU-only inference works on any machine. Consider smaller models like Phi-3 (2.3GB) which runs acceptably on older hardware. Q: How do I update models? A: Download new GGUF files and restart server with-m /path/to/new-model.gguf.
Q: Can I run multiple llama.cpp servers for load balancing?
A: Yes, start each on a different port and create separate CCS profiles pointing to each.
Q: Is llama.cpp compatible with Claude API features?
A: Basic chat only. Features like vision, web search, and extended thinking require Claude’s official API.
Storage Locations
| Path | Description |
|---|---|
~/Downloads/models/ | Default location for GGUF files |
~/.ccs/config.yaml | CCS profile configuration |
llama.cpp/server | Binary after build |
Next Steps
API Profiles
Learn how to create and manage API profiles
Ollama Provider
Compare with Ollama (cloud/local models)
Model Models
Explore all available models and providers
Troubleshooting
Common issues and solutions
