> ## Documentation Index > Fetch the complete documentation index at: https://docs.ccs.kaitran.ca/llms.txt > Use this file to discover all available pages before exploring further. # Llama.cpp Provider > Local GGUF model inference via llama.cpp server for private, offline AI # Llama.cpp Provider Run GGUF-quantized LLaMA models locally via llama.cpp server with zero API costs, complete privacy, and offline capability. ## Overview | Field | Details | | ------------------------ | ------------------------- | | **Preset ID** | `llamacpp` | | **Aliases** | None | | **Default Profile Name** | `llamacpp` | | **Default Model** | `llama3-8b` | | **Base URL** | `http://127.0.0.1:8080` | | **Auth Method** | Local (no API key needed) | | **Category** | Recommended | ## Quick Start ```bash theme={null} # 1. Start llama.cpp server (in separate terminal) ./server --host 0.0.0.0 --port 8080 -m /path/to/model.gguf # 2. Create CCS profile ccs api create --preset llamacpp # or use the direct shortcut (creates profile automatically if needed) ccs llamacpp "explain quantum computing" # 3. Use the profile ccs llamacpp "explain quantum computing" ``` ## Prerequisites ### Installing llama.cpp ```bash theme={null} git clone https://github.com/ggerganov/llama.cpp cd llama.cpp ``` **On macOS (with Metal acceleration):** ```bash theme={null} make clean make -j LLAMA_METAL=1 ``` **On Linux (with CUDA):** ```bash theme={null} make clean make -j LLAMA_CUDA=1 ``` **On Windows (CPU only):** ```bash theme={null} make clean make -j ``` Download a GGUF-quantized model from [Hugging Face](https://huggingface.co/models?search=gguf): ```bash theme={null} # Example: LLaMA 3 8B (Q4_0 quantization - 4-5GB) wget https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf # Or: Phi-3 (smaller, faster) wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-Q4.gguf ``` Q4\_0 quantization (4-bit) offers best balance of speed and quality. Q5\_K\_M is higher quality but slower. ```bash theme={null} ./server --help | head -20 # Should show server options ``` ### Model Selection Popular GGUF models for coding and general tasks: | Model | Size | Speed | Quality | RAM | Context | | ---------------- | ----- | --------- | ------------------ | ----- | ------- | | Phi-3 Mini (Q4) | 2.3GB | Very Fast | Good | 4GB | 4K | | LLaMA 3 8B (Q4) | 4.7GB | Fast | Good | 8GB | 8K | | LLaMA 3 70B (Q4) | 40GB | Slow | Excellent | 48GB+ | 8K | | Mistral 7B (Q4) | 4.2GB | Very Fast | Good | 8GB | 8K | | Qwen3 Coder (Q4) | 5.2GB | Fast | Excellent (coding) | 8GB | 128K | **Recommended for coding:** Qwen3 Coder, LLaMA 3 8B, or Phi-3 ## Configuration ### Default Setup ```bash theme={null} # Create profile with defaults (localhost:8080) ccs api create --preset llamacpp # Profile is created with: # Base URL: http://127.0.0.1:8080 # Model: llama3-8b # No API key required ``` ### Custom Configuration If llama.cpp server runs on different host/port: ```bash theme={null} # Via CLI ccs api create --preset llamacpp --base-url http://192.168.1.100:8080 # Or manual config in ~/.ccs/config.yaml profiles: llamacpp-custom: env: ANTHROPIC_BASE_URL: "http://192.168.1.100:8080" ANTHROPIC_AUTH_TOKEN: "llamacpp" ANTHROPIC_MODEL: "llama3-8b" ANTHROPIC_DEFAULT_OPUS_MODEL: "llama3-70b" ANTHROPIC_DEFAULT_SONNET_MODEL: "llama3-8b" ANTHROPIC_DEFAULT_HAIKU_MODEL: "phi3-mini" ``` ## Starting llama.cpp Server ### Basic Usage ```bash theme={null} # Start with a specific model ./server -m /path/to/model.gguf # With default CCS settings (accessible on localhost:8080) ./server -m /path/to/model.gguf --host 0.0.0.0 --port 8080 ``` ### Performance Tuning ```bash theme={null} # Use GPU acceleration (Metal on macOS, CUDA on Linux) ./server -m /path/to/model.gguf \ --host 0.0.0.0 --port 8080 \ -ngl 99 # Offload all layers to GPU # Use specific number of threads ./server -m /path/to/model.gguf \ --host 0.0.0.0 --port 8080 \ -t 8 # Use 8 threads # Limit context to reduce memory usage ./server -m /path/to/model.gguf \ --host 0.0.0.0 --port 8080 \ -c 2048 # Limit to 2K context (default 2048) ``` ### Running Multiple Models ```bash theme={null} # Terminal 1: Primary model on :8080 ./server -m /path/to/qwen3-coder.gguf --host 0.0.0.0 --port 8080 -ngl 99 # Terminal 2: Alternative model on :8081 ./server -m /path/to/llama3-8b.gguf --host 0.0.0.0 --port 8081 -ngl 99 # Terminal 3: Use CCS with either ccs llamacpp "coding task" # Uses 8080 ANTHROPIC_BASE_URL=http://localhost:8081 ccs llamacpp "analysis task" ``` ## Usage Examples ### Basic Chat ```bash theme={null} # Use default model ccs llamacpp "explain this code" # Check which model is running ccs llamacpp "what model are you?" ``` ### Model-Specific Usage ```bash theme={null} # Override model for this request ANTHROPIC_MODEL=llama3-70b ccs llamacpp "complex system design" # Use different quantization variant ANTHROPIC_MODEL=qwen3-coder:q5 ccs llamacpp "debug this issue" ``` ### Streaming and Output Control ```bash theme={null} # Default streaming response ccs llamacpp "write a function" # Set temperature for creativity ANTHROPIC_TEMPERATURE=0.8 ccs llamacpp "brainstorm ideas" # Limit response length ANTHROPIC_MAX_TOKENS=500 ccs llamacpp "summarize this" ``` ## Troubleshooting ### Connection Refused **Symptom:** `Error: connect ECONNREFUSED 127.0.0.1:8080` **Causes & Solutions:** 1. **Server not running** — Start llama.cpp server in separate terminal 2. **Wrong port** — Check server is on 8080 or update ANTHROPIC\_BASE\_URL 3. **Firewall blocked** — Allow localhost connections ```bash theme={null} # Verify server is listening netstat -an | grep 8080 # or curl http://127.0.0.1:8080/health ``` ### Model Not Found or Slow Response **Symptom:** Invalid model or very slow responses **Solutions:** ```bash theme={null} # Check available models on server curl http://127.0.0.1:8080/models # Switch to faster model ANTHROPIC_MODEL=phi3-mini ccs llamacpp "quick task" # Restart server with smaller model ./server -m /path/to/smaller-model.gguf ``` ### Out of Memory **Symptom:** `VRAM out of memory` or process crash **Solutions:** 1. **Reduce context** — Limit with `-c 1024` when starting server 2. **Switch quantization** — Use Q4\_0 instead of Q5\_K\_M 3. **Reduce layers on GPU** — Use `-ngl 30` instead of `-ngl 99` 4. **Use smaller model** — Switch from 70B to 8B variant ```bash theme={null} # Start with conservative GPU settings ./server -m /path/to/model.gguf \ --host 0.0.0.0 --port 8080 \ -ngl 30 \ -c 1024 ``` ### Port Already in Use **Symptom:** `Address already in use` **Solutions:** ```bash theme={null} # Find what's using port 8080 lsof -i :8080 # Kill the process (if it's old llama.cpp instance) kill -9 # Or use a different port ./server -m /path/to/model.gguf --host 0.0.0.0 --port 8081 # Then configure CCS ANTHROPIC_BASE_URL=http://127.0.0.1:8081 ccs llamacpp "test" ``` ## Performance Optimization ### GPU Acceleration ```bash theme={null} # macOS with Metal GPU ./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99 # Linux with CUDA ./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99 # Disable GPU (CPU only) ./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 0 ``` ### Memory Management | Setting | Impact | Use When | | --------- | ----------------- | ----------------------------------- | | `-ngl 99` | All layers on GPU | Plenty of VRAM (GPU-only inference) | | `-ngl 30` | Partial offload | Limited VRAM (mixed CPU/GPU) | | `-ngl 0` | CPU only | No GPU or testing | | `-c 1024` | Small context | Limited RAM (\<8GB) | | `-c 2048` | Default context | 8-16GB RAM | | `-c 4096` | Large context | 16GB+ RAM | ### Batch Size and Threads ```bash theme={null} # For faster inference with multiple users ./server -m model.gguf --host 0.0.0.0 --port 8080 \ -t 16 # Use more threads -b 512 # Larger batch size for throughput -ngl 99 # GPU acceleration # For single user, lower latency ./server -m model.gguf --host 0.0.0.0 --port 8080 \ -t 4 # Fewer threads -b 128 # Small batch for responsiveness -ngl 99 ``` ## Cost Analysis | Factor | Cost | | ---------------- | ------------------------------------------ | | **API costs** | \$0 (free) | | **Hardware** | GPU (optional): \$200-800, or use CPU | | **Electricity** | \~50-150W continuous (varies by model/GPU) | | **Privacy** | Complete — data never leaves your machine | | **Availability** | Offline-capable once model is downloaded | ## Common Questions **Q: Can I use llama.cpp with older hardware?** A: Yes, but it will be slow. CPU-only inference works on any machine. Consider smaller models like Phi-3 (2.3GB) which runs acceptably on older hardware. **Q: How do I update models?** A: Download new GGUF files and restart server with `-m /path/to/new-model.gguf`. **Q: Can I run multiple llama.cpp servers for load balancing?** A: Yes, start each on a different port and create separate CCS profiles pointing to each. **Q: Is llama.cpp compatible with Claude API features?** A: Basic chat only. Features like vision, web search, and extended thinking require Claude's official API. ## Storage Locations | Path | Description | | --------------------- | ------------------------------- | | `~/Downloads/models/` | Default location for GGUF files | | `~/.ccs/config.yaml` | CCS profile configuration | | `llama.cpp/server` | Binary after build | ## Next Steps Learn how to create and manage API profiles Compare with Ollama (cloud/local models) Explore all available models and providers Common issues and solutions