Skip to main content

Llama.cpp Provider

Run GGUF-quantized LLaMA models locally via llama.cpp server with zero API costs, complete privacy, and offline capability.

Overview

FieldDetails
Preset IDllamacpp
AliasesNone
Default Profile Namellamacpp
Default Modelllama3-8b
Base URLhttp://127.0.0.1:8080
Auth MethodLocal (no API key needed)
CategoryRecommended

Quick Start

# 1. Start llama.cpp server (in separate terminal)
./server --host 0.0.0.0 --port 8080 -m /path/to/model.gguf

# 2. Create CCS profile (in another terminal)
ccs api create --preset llamacpp

# 3. Use the profile
ccs llamacpp "explain quantum computing"

Prerequisites

Installing llama.cpp

1

Clone Repository

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
2

Build Server

On macOS (with Metal acceleration):
make clean
make -j LLAMA_METAL=1
On Linux (with CUDA):
make clean
make -j LLAMA_CUDA=1
On Windows (CPU only):
make clean
make -j
3

Download Model

Download a GGUF-quantized model from Hugging Face:
# Example: LLaMA 3 8B (Q4_0 quantization - 4-5GB)
wget https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf

# Or: Phi-3 (smaller, faster)
wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-Q4.gguf
Q4_0 quantization (4-bit) offers best balance of speed and quality. Q5_K_M is higher quality but slower.
4

Verify Installation

./server --help | head -20
# Should show server options

Model Selection

Popular GGUF models for coding and general tasks:
ModelSizeSpeedQualityRAMContext
Phi-3 Mini (Q4)2.3GBVery FastGood4GB4K
LLaMA 3 8B (Q4)4.7GBFastGood8GB8K
LLaMA 3 70B (Q4)40GBSlowExcellent48GB+8K
Mistral 7B (Q4)4.2GBVery FastGood8GB8K
Qwen3 Coder (Q4)5.2GBFastExcellent (coding)8GB128K
Recommended for coding: Qwen3 Coder, LLaMA 3 8B, or Phi-3

Configuration

Default Setup

# Create profile with defaults (localhost:8080)
ccs api create --preset llamacpp

# Profile is created with:
# Base URL: http://127.0.0.1:8080
# Model: llama3-8b
# No API key required

Custom Configuration

If llama.cpp server runs on different host/port:
# Via CLI
ccs api create --preset llamacpp --base-url http://192.168.1.100:8080

# Or manual config in ~/.ccs/config.yaml
profiles:
  llamacpp-custom:
    env:
      ANTHROPIC_BASE_URL: "http://192.168.1.100:8080"
      ANTHROPIC_AUTH_TOKEN: "llamacpp"
      ANTHROPIC_MODEL: "llama3-8b"
      ANTHROPIC_DEFAULT_OPUS_MODEL: "llama3-70b"
      ANTHROPIC_DEFAULT_SONNET_MODEL: "llama3-8b"
      ANTHROPIC_DEFAULT_HAIKU_MODEL: "phi3-mini"

Starting llama.cpp Server

Basic Usage

# Start with a specific model
./server -m /path/to/model.gguf

# With default CCS settings (accessible on localhost:8080)
./server -m /path/to/model.gguf --host 0.0.0.0 --port 8080

Performance Tuning

# Use GPU acceleration (Metal on macOS, CUDA on Linux)
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99  # Offload all layers to GPU

# Use specific number of threads
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -t 8  # Use 8 threads

# Limit context to reduce memory usage
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 2048  # Limit to 2K context (default 2048)

Running Multiple Models

# Terminal 1: Primary model on :8080
./server -m /path/to/qwen3-coder.gguf --host 0.0.0.0 --port 8080 -ngl 99

# Terminal 2: Alternative model on :8081
./server -m /path/to/llama3-8b.gguf --host 0.0.0.0 --port 8081 -ngl 99

# Terminal 3: Use CCS with either
ccs llamacpp "coding task"  # Uses 8080
ANTHROPIC_BASE_URL=http://localhost:8081 ccs llamacpp "analysis task"

Usage Examples

Basic Chat

# Use default model
ccs llamacpp "explain this code"

# Check which model is running
ccs llamacpp "what model are you?"

Model-Specific Usage

# Override model for this request
ANTHROPIC_MODEL=llama3-70b ccs llamacpp "complex system design"

# Use different quantization variant
ANTHROPIC_MODEL=qwen3-coder:q5 ccs llamacpp "debug this issue"

Streaming and Output Control

# Default streaming response
ccs llamacpp "write a function"

# Set temperature for creativity
ANTHROPIC_TEMPERATURE=0.8 ccs llamacpp "brainstorm ideas"

# Limit response length
ANTHROPIC_MAX_TOKENS=500 ccs llamacpp "summarize this"

Troubleshooting

Connection Refused

Symptom: Error: connect ECONNREFUSED 127.0.0.1:8080 Causes & Solutions:
  1. Server not running — Start llama.cpp server in separate terminal
  2. Wrong port — Check server is on 8080 or update ANTHROPIC_BASE_URL
  3. Firewall blocked — Allow localhost connections
# Verify server is listening
netstat -an | grep 8080
# or
curl http://127.0.0.1:8080/health

Model Not Found or Slow Response

Symptom: Invalid model or very slow responses Solutions:
# Check available models on server
curl http://127.0.0.1:8080/models

# Switch to faster model
ANTHROPIC_MODEL=phi3-mini ccs llamacpp "quick task"

# Restart server with smaller model
./server -m /path/to/smaller-model.gguf

Out of Memory

Symptom: VRAM out of memory or process crash Solutions:
  1. Reduce context — Limit with -c 1024 when starting server
  2. Switch quantization — Use Q4_0 instead of Q5_K_M
  3. Reduce layers on GPU — Use -ngl 30 instead of -ngl 99
  4. Use smaller model — Switch from 70B to 8B variant
# Start with conservative GPU settings
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 30 \
  -c 1024

Port Already in Use

Symptom: Address already in use Solutions:
# Find what's using port 8080
lsof -i :8080

# Kill the process (if it's old llama.cpp instance)
kill -9 <PID>

# Or use a different port
./server -m /path/to/model.gguf --host 0.0.0.0 --port 8081

# Then configure CCS
ANTHROPIC_BASE_URL=http://127.0.0.1:8081 ccs llamacpp "test"

Performance Optimization

GPU Acceleration

# macOS with Metal GPU
./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99

# Linux with CUDA
./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99

# Disable GPU (CPU only)
./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 0

Memory Management

SettingImpactUse When
-ngl 99All layers on GPUPlenty of VRAM (GPU-only inference)
-ngl 30Partial offloadLimited VRAM (mixed CPU/GPU)
-ngl 0CPU onlyNo GPU or testing
-c 1024Small contextLimited RAM (<8GB)
-c 2048Default context8-16GB RAM
-c 4096Large context16GB+ RAM

Batch Size and Threads

# For faster inference with multiple users
./server -m model.gguf --host 0.0.0.0 --port 8080 \
  -t 16        # Use more threads
  -b 512       # Larger batch size for throughput
  -ngl 99      # GPU acceleration

# For single user, lower latency
./server -m model.gguf --host 0.0.0.0 --port 8080 \
  -t 4         # Fewer threads
  -b 128       # Small batch for responsiveness
  -ngl 99

Cost Analysis

FactorCost
API costs$0 (free)
HardwareGPU (optional): $200-800, or use CPU
Electricity~50-150W continuous (varies by model/GPU)
PrivacyComplete — data never leaves your machine
AvailabilityOffline-capable once model is downloaded

Common Questions

Q: Can I use llama.cpp with older hardware? A: Yes, but it will be slow. CPU-only inference works on any machine. Consider smaller models like Phi-3 (2.3GB) which runs acceptably on older hardware. Q: How do I update models? A: Download new GGUF files and restart server with -m /path/to/new-model.gguf. Q: Can I run multiple llama.cpp servers for load balancing? A: Yes, start each on a different port and create separate CCS profiles pointing to each. Q: Is llama.cpp compatible with Claude API features? A: Basic chat only. Features like vision, web search, and extended thinking require Claude’s official API.

Storage Locations

PathDescription
~/Downloads/models/Default location for GGUF files
~/.ccs/config.yamlCCS profile configuration
llama.cpp/serverBinary after build

Next Steps

API Profiles

Learn how to create and manage API profiles

Ollama Provider

Compare with Ollama (cloud/local models)

Model Models

Explore all available models and providers

Troubleshooting

Common issues and solutions