Llama.cpp Provider

Run GGUF-quantized LLaMA models locally via llama.cpp server with zero API costs, complete privacy, and offline capability.

Overview

Field	Details
Preset ID	`llamacpp`
Aliases	None
Default Profile Name	`llamacpp`
Default Model	`llama3-8b`
Base URL	`http://127.0.0.1:8080`
Auth Method	Local (no API key needed)
Category	Recommended

Quick Start

# 1. Start llama.cpp server (in separate terminal)
./server --host 0.0.0.0 --port 8080 -m /path/to/model.gguf

# 2. Create CCS profile
ccs api create --preset llamacpp
# or use the direct shortcut (creates profile automatically if needed)
ccs llamacpp "explain quantum computing"

# 3. Use the profile
ccs llamacpp "explain quantum computing"

Prerequisites

Installing llama.cpp

Clone Repository

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build Server

On macOS (with Metal acceleration):

make clean
make -j LLAMA_METAL=1

On Linux (with CUDA):

make clean
make -j LLAMA_CUDA=1

On Windows (CPU only):

make clean
make -j

Download Model

Download a GGUF-quantized model from Hugging Face:

# Example: LLaMA 3 8B (Q4_0 quantization - 4-5GB)
wget https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf

# Or: Phi-3 (smaller, faster)
wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-Q4.gguf

Q4_0 quantization (4-bit) offers best balance of speed and quality. Q5_K_M is higher quality but slower.

Verify Installation

./server --help | head -20
# Should show server options

Model Selection

Popular GGUF models for coding and general tasks:

Model	Size	Speed	Quality	RAM	Context
Phi-3 Mini (Q4)	2.3GB	Very Fast	Good	4GB	4K
LLaMA 3 8B (Q4)	4.7GB	Fast	Good	8GB	8K
LLaMA 3 70B (Q4)	40GB	Slow	Excellent	48GB+	8K
Mistral 7B (Q4)	4.2GB	Very Fast	Good	8GB	8K
Qwen3 Coder (Q4)	5.2GB	Fast	Excellent (coding)	8GB	128K

Recommended for coding: Qwen3 Coder, LLaMA 3 8B, or Phi-3

Configuration

Default Setup

# Create profile with defaults (localhost:8080)
ccs api create --preset llamacpp

# Profile is created with:
# Base URL: http://127.0.0.1:8080
# Model: llama3-8b
# No API key required

Custom Configuration

If llama.cpp server runs on different host/port:

# Via CLI
ccs api create --preset llamacpp --base-url http://192.168.1.100:8080

# Or manual config in ~/.ccs/config.yaml
profiles:
  llamacpp-custom:
    env:
      ANTHROPIC_BASE_URL: "http://192.168.1.100:8080"
      ANTHROPIC_AUTH_TOKEN: "llamacpp"
      ANTHROPIC_MODEL: "llama3-8b"
      ANTHROPIC_DEFAULT_OPUS_MODEL: "llama3-70b"
      ANTHROPIC_DEFAULT_SONNET_MODEL: "llama3-8b"
      ANTHROPIC_DEFAULT_HAIKU_MODEL: "phi3-mini"

Starting llama.cpp Server

Basic Usage

# Start with a specific model
./server -m /path/to/model.gguf

# With default CCS settings (accessible on localhost:8080)
./server -m /path/to/model.gguf --host 0.0.0.0 --port 8080

Performance Tuning

# Use GPU acceleration (Metal on macOS, CUDA on Linux)
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99  # Offload all layers to GPU

# Use specific number of threads
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -t 8  # Use 8 threads

# Limit context to reduce memory usage
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 2048  # Limit to 2K context (default 2048)

Running Multiple Models

# Terminal 1: Primary model on :8080
./server -m /path/to/qwen3-coder.gguf --host 0.0.0.0 --port 8080 -ngl 99

# Terminal 2: Alternative model on :8081
./server -m /path/to/llama3-8b.gguf --host 0.0.0.0 --port 8081 -ngl 99

# Terminal 3: Use CCS with either
ccs llamacpp "coding task"  # Uses 8080
ANTHROPIC_BASE_URL=http://localhost:8081 ccs llamacpp "analysis task"

Usage Examples

Basic Chat

# Use default model
ccs llamacpp "explain this code"

# Check which model is running
ccs llamacpp "what model are you?"

Model-Specific Usage

# Override model for this request
ANTHROPIC_MODEL=llama3-70b ccs llamacpp "complex system design"

# Use different quantization variant
ANTHROPIC_MODEL=qwen3-coder:q5 ccs llamacpp "debug this issue"

Streaming and Output Control

# Default streaming response
ccs llamacpp "write a function"

# Set temperature for creativity
ANTHROPIC_TEMPERATURE=0.8 ccs llamacpp "brainstorm ideas"

# Limit response length
ANTHROPIC_MAX_TOKENS=500 ccs llamacpp "summarize this"

Troubleshooting

Connection Refused

Symptom: Error: connect ECONNREFUSED 127.0.0.1:8080 Causes & Solutions:

Server not running — Start llama.cpp server in separate terminal
Wrong port — Check server is on 8080 or update ANTHROPIC_BASE_URL
Firewall blocked — Allow localhost connections

# Verify server is listening
netstat -an | grep 8080
# or
curl http://127.0.0.1:8080/health

Model Not Found or Slow Response

Symptom: Invalid model or very slow responses Solutions:

# Check available models on server
curl http://127.0.0.1:8080/models

# Switch to faster model
ANTHROPIC_MODEL=phi3-mini ccs llamacpp "quick task"

# Restart server with smaller model
./server -m /path/to/smaller-model.gguf

Out of Memory

Symptom: VRAM out of memory or process crash Solutions:

Reduce context — Limit with -c 1024 when starting server
Switch quantization — Use Q4_0 instead of Q5_K_M
Reduce layers on GPU — Use -ngl 30 instead of -ngl 99
Use smaller model — Switch from 70B to 8B variant

# Start with conservative GPU settings
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 30 \
  -c 1024

Port Already in Use

Symptom: Address already in use Solutions:

# Find what's using port 8080
lsof -i :8080

# Kill the process (if it's old llama.cpp instance)
kill -9 <PID>

# Or use a different port
./server -m /path/to/model.gguf --host 0.0.0.0 --port 8081

# Then configure CCS
ANTHROPIC_BASE_URL=http://127.0.0.1:8081 ccs llamacpp "test"

Performance Optimization

GPU Acceleration

# macOS with Metal GPU
./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99

# Linux with CUDA
./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99

# Disable GPU (CPU only)
./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 0

Memory Management

Setting	Impact	Use When
`-ngl 99`	All layers on GPU	Plenty of VRAM (GPU-only inference)
`-ngl 30`	Partial offload	Limited VRAM (mixed CPU/GPU)
`-ngl 0`	CPU only	No GPU or testing
`-c 1024`	Small context	Limited RAM (<8GB)
`-c 2048`	Default context	8-16GB RAM
`-c 4096`	Large context	16GB+ RAM

Batch Size and Threads

# For faster inference with multiple users
./server -m model.gguf --host 0.0.0.0 --port 8080 \
  -t 16        # Use more threads
  -b 512       # Larger batch size for throughput
  -ngl 99      # GPU acceleration

# For single user, lower latency
./server -m model.gguf --host 0.0.0.0 --port 8080 \
  -t 4         # Fewer threads
  -b 128       # Small batch for responsiveness
  -ngl 99

Cost Analysis

Factor	Cost
API costs	$0 (free)
Hardware	GPU (optional): $200-800, or use CPU
Electricity	~50-150W continuous (varies by model/GPU)
Privacy	Complete — data never leaves your machine
Availability	Offline-capable once model is downloaded

Common Questions

Q: Can I use llama.cpp with older hardware? A: Yes, but it will be slow. CPU-only inference works on any machine. Consider smaller models like Phi-3 (2.3GB) which runs acceptably on older hardware. Q: How do I update models? A: Download new GGUF files and restart server with -m /path/to/new-model.gguf. Q: Can I run multiple llama.cpp servers for load balancing? A: Yes, start each on a different port and create separate CCS profiles pointing to each. Q: Is llama.cpp compatible with Claude API features? A: Basic chat only. Features like vision, web search, and extended thinking require Claude’s official API.

Storage Locations

Path	Description
`~/Downloads/models/`	Default location for GGUF files
`~/.ccs/config.yaml`	CCS profile configuration
`llama.cpp/server`	Binary after build

Next Steps

API Profiles

Learn how to create and manage API profiles

Ollama Provider

Compare with Ollama (cloud/local models)

Model Models

Explore all available models and providers

Troubleshooting

Common issues and solutions

Getting Started

Providers

Features

Tutorials

Llama.cpp Provider

Llama.cpp Provider

Overview

Quick Start

Prerequisites

Installing llama.cpp

Model Selection

Configuration

Default Setup

Custom Configuration

Starting llama.cpp Server

Basic Usage

Performance Tuning

Running Multiple Models

Usage Examples

Basic Chat

Model-Specific Usage

Streaming and Output Control

Troubleshooting

Connection Refused

Model Not Found or Slow Response

Out of Memory

Port Already in Use

Performance Optimization

GPU Acceleration

Memory Management

Batch Size and Threads

Cost Analysis

Common Questions

Storage Locations

Next Steps

API Profiles

Ollama Provider

Model Models

Troubleshooting

Getting Started

Providers

Features

Tutorials

Documentation Index

​Llama.cpp Provider

​Overview

​Quick Start

​Prerequisites

​Installing llama.cpp

​Model Selection

​Configuration

​Default Setup

​Custom Configuration

​Starting llama.cpp Server

​Basic Usage

​Performance Tuning

​Running Multiple Models

​Usage Examples

​Basic Chat

​Model-Specific Usage

​Streaming and Output Control

​Troubleshooting

​Connection Refused

​Model Not Found or Slow Response

​Out of Memory

​Port Already in Use

​Performance Optimization

​GPU Acceleration

​Memory Management

​Batch Size and Threads

​Cost Analysis

​Common Questions

​Storage Locations

​Next Steps

API Profiles

Ollama Provider

Model Models

Troubleshooting

Llama.cpp Provider

Overview

Quick Start

Prerequisites

Installing llama.cpp

Model Selection

Configuration

Default Setup

Custom Configuration

Starting llama.cpp Server

Basic Usage

Performance Tuning

Running Multiple Models

Usage Examples

Basic Chat

Model-Specific Usage

Streaming and Output Control

Troubleshooting

Connection Refused

Model Not Found or Slow Response

Out of Memory

Port Already in Use

Performance Optimization

GPU Acceleration

Memory Management

Batch Size and Threads

Cost Analysis

Common Questions

Storage Locations

Next Steps