> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ccs.kaitran.ca/llms.txt
> Use this file to discover all available pages before exploring further.

# Llama.cpp Provider

> Local GGUF model inference via llama.cpp server for private, offline AI

# Llama.cpp Provider

Run GGUF-quantized LLaMA models locally via llama.cpp server with zero API costs, complete privacy, and offline capability.

## Overview

| Field                    | Details                   |
| ------------------------ | ------------------------- |
| **Preset ID**            | `llamacpp`                |
| **Aliases**              | None                      |
| **Default Profile Name** | `llamacpp`                |
| **Default Model**        | `llama3-8b`               |
| **Base URL**             | `http://127.0.0.1:8080`   |
| **Auth Method**          | Local (no API key needed) |
| **Category**             | Recommended               |

## Quick Start

```bash theme={null}
# 1. Start llama.cpp server (in separate terminal)
./server --host 0.0.0.0 --port 8080 -m /path/to/model.gguf

# 2. Create CCS profile
ccs api create --preset llamacpp
# or use the direct shortcut (creates profile automatically if needed)
ccs llamacpp "explain quantum computing"

# 3. Use the profile
ccs llamacpp "explain quantum computing"
```

## Prerequisites

### Installing llama.cpp

<Steps>
  <Step title="Clone Repository">
    ```bash theme={null}
    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    ```
  </Step>

  <Step title="Build Server">
    **On macOS (with Metal acceleration):**

    ```bash theme={null}
    make clean
    make -j LLAMA_METAL=1
    ```

    **On Linux (with CUDA):**

    ```bash theme={null}
    make clean
    make -j LLAMA_CUDA=1
    ```

    **On Windows (CPU only):**

    ```bash theme={null}
    make clean
    make -j
    ```
  </Step>

  <Step title="Download Model">
    Download a GGUF-quantized model from [Hugging Face](https://huggingface.co/models?search=gguf):

    ```bash theme={null}
    # Example: LLaMA 3 8B (Q4_0 quantization - 4-5GB)
    wget https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf

    # Or: Phi-3 (smaller, faster)
    wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-Q4.gguf
    ```

    <Tip>
      Q4\_0 quantization (4-bit) offers best balance of speed and quality. Q5\_K\_M is higher quality but slower.
    </Tip>
  </Step>

  <Step title="Verify Installation">
    ```bash theme={null}
    ./server --help | head -20
    # Should show server options
    ```
  </Step>
</Steps>

### Model Selection

Popular GGUF models for coding and general tasks:

| Model            | Size  | Speed     | Quality            | RAM   | Context |
| ---------------- | ----- | --------- | ------------------ | ----- | ------- |
| Phi-3 Mini (Q4)  | 2.3GB | Very Fast | Good               | 4GB   | 4K      |
| LLaMA 3 8B (Q4)  | 4.7GB | Fast      | Good               | 8GB   | 8K      |
| LLaMA 3 70B (Q4) | 40GB  | Slow      | Excellent          | 48GB+ | 8K      |
| Mistral 7B (Q4)  | 4.2GB | Very Fast | Good               | 8GB   | 8K      |
| Qwen3 Coder (Q4) | 5.2GB | Fast      | Excellent (coding) | 8GB   | 128K    |

**Recommended for coding:** Qwen3 Coder, LLaMA 3 8B, or Phi-3

## Configuration

### Default Setup

```bash theme={null}
# Create profile with defaults (localhost:8080)
ccs api create --preset llamacpp

# Profile is created with:
# Base URL: http://127.0.0.1:8080
# Model: llama3-8b
# No API key required
```

### Custom Configuration

If llama.cpp server runs on different host/port:

```bash theme={null}
# Via CLI
ccs api create --preset llamacpp --base-url http://192.168.1.100:8080

# Or manual config in ~/.ccs/config.yaml
profiles:
  llamacpp-custom:
    env:
      ANTHROPIC_BASE_URL: "http://192.168.1.100:8080"
      ANTHROPIC_AUTH_TOKEN: "llamacpp"
      ANTHROPIC_MODEL: "llama3-8b"
      ANTHROPIC_DEFAULT_OPUS_MODEL: "llama3-70b"
      ANTHROPIC_DEFAULT_SONNET_MODEL: "llama3-8b"
      ANTHROPIC_DEFAULT_HAIKU_MODEL: "phi3-mini"
```

## Starting llama.cpp Server

### Basic Usage

```bash theme={null}
# Start with a specific model
./server -m /path/to/model.gguf

# With default CCS settings (accessible on localhost:8080)
./server -m /path/to/model.gguf --host 0.0.0.0 --port 8080
```

### Performance Tuning

```bash theme={null}
# Use GPU acceleration (Metal on macOS, CUDA on Linux)
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99  # Offload all layers to GPU

# Use specific number of threads
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -t 8  # Use 8 threads

# Limit context to reduce memory usage
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 2048  # Limit to 2K context (default 2048)
```

### Running Multiple Models

```bash theme={null}
# Terminal 1: Primary model on :8080
./server -m /path/to/qwen3-coder.gguf --host 0.0.0.0 --port 8080 -ngl 99

# Terminal 2: Alternative model on :8081
./server -m /path/to/llama3-8b.gguf --host 0.0.0.0 --port 8081 -ngl 99

# Terminal 3: Use CCS with either
ccs llamacpp "coding task"  # Uses 8080
ANTHROPIC_BASE_URL=http://localhost:8081 ccs llamacpp "analysis task"
```

## Usage Examples

### Basic Chat

```bash theme={null}
# Use default model
ccs llamacpp "explain this code"

# Check which model is running
ccs llamacpp "what model are you?"
```

### Model-Specific Usage

```bash theme={null}
# Override model for this request
ANTHROPIC_MODEL=llama3-70b ccs llamacpp "complex system design"

# Use different quantization variant
ANTHROPIC_MODEL=qwen3-coder:q5 ccs llamacpp "debug this issue"
```

### Streaming and Output Control

```bash theme={null}
# Default streaming response
ccs llamacpp "write a function"

# Set temperature for creativity
ANTHROPIC_TEMPERATURE=0.8 ccs llamacpp "brainstorm ideas"

# Limit response length
ANTHROPIC_MAX_TOKENS=500 ccs llamacpp "summarize this"
```

## Troubleshooting

### Connection Refused

**Symptom:** `Error: connect ECONNREFUSED 127.0.0.1:8080`

**Causes & Solutions:**

1. **Server not running** — Start llama.cpp server in separate terminal
2. **Wrong port** — Check server is on 8080 or update ANTHROPIC\_BASE\_URL
3. **Firewall blocked** — Allow localhost connections

```bash theme={null}
# Verify server is listening
netstat -an | grep 8080
# or
curl http://127.0.0.1:8080/health
```

### Model Not Found or Slow Response

**Symptom:** Invalid model or very slow responses

**Solutions:**

```bash theme={null}
# Check available models on server
curl http://127.0.0.1:8080/models

# Switch to faster model
ANTHROPIC_MODEL=phi3-mini ccs llamacpp "quick task"

# Restart server with smaller model
./server -m /path/to/smaller-model.gguf
```

### Out of Memory

**Symptom:** `VRAM out of memory` or process crash

**Solutions:**

1. **Reduce context** — Limit with `-c 1024` when starting server
2. **Switch quantization** — Use Q4\_0 instead of Q5\_K\_M
3. **Reduce layers on GPU** — Use `-ngl 30` instead of `-ngl 99`
4. **Use smaller model** — Switch from 70B to 8B variant

```bash theme={null}
# Start with conservative GPU settings
./server -m /path/to/model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 30 \
  -c 1024
```

### Port Already in Use

**Symptom:** `Address already in use`

**Solutions:**

```bash theme={null}
# Find what's using port 8080
lsof -i :8080

# Kill the process (if it's old llama.cpp instance)
kill -9 <PID>

# Or use a different port
./server -m /path/to/model.gguf --host 0.0.0.0 --port 8081

# Then configure CCS
ANTHROPIC_BASE_URL=http://127.0.0.1:8081 ccs llamacpp "test"
```

## Performance Optimization

### GPU Acceleration

```bash theme={null}
# macOS with Metal GPU
./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99

# Linux with CUDA
./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99

# Disable GPU (CPU only)
./server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 0
```

### Memory Management

| Setting   | Impact            | Use When                            |
| --------- | ----------------- | ----------------------------------- |
| `-ngl 99` | All layers on GPU | Plenty of VRAM (GPU-only inference) |
| `-ngl 30` | Partial offload   | Limited VRAM (mixed CPU/GPU)        |
| `-ngl 0`  | CPU only          | No GPU or testing                   |
| `-c 1024` | Small context     | Limited RAM (\<8GB)                 |
| `-c 2048` | Default context   | 8-16GB RAM                          |
| `-c 4096` | Large context     | 16GB+ RAM                           |

### Batch Size and Threads

```bash theme={null}
# For faster inference with multiple users
./server -m model.gguf --host 0.0.0.0 --port 8080 \
  -t 16        # Use more threads
  -b 512       # Larger batch size for throughput
  -ngl 99      # GPU acceleration

# For single user, lower latency
./server -m model.gguf --host 0.0.0.0 --port 8080 \
  -t 4         # Fewer threads
  -b 128       # Small batch for responsiveness
  -ngl 99
```

## Cost Analysis

| Factor           | Cost                                       |
| ---------------- | ------------------------------------------ |
| **API costs**    | \$0 (free)                                 |
| **Hardware**     | GPU (optional): \$200-800, or use CPU      |
| **Electricity**  | \~50-150W continuous (varies by model/GPU) |
| **Privacy**      | Complete — data never leaves your machine  |
| **Availability** | Offline-capable once model is downloaded   |

## Common Questions

**Q: Can I use llama.cpp with older hardware?**

A: Yes, but it will be slow. CPU-only inference works on any machine. Consider smaller models like Phi-3 (2.3GB) which runs acceptably on older hardware.

**Q: How do I update models?**

A: Download new GGUF files and restart server with `-m /path/to/new-model.gguf`.

**Q: Can I run multiple llama.cpp servers for load balancing?**

A: Yes, start each on a different port and create separate CCS profiles pointing to each.

**Q: Is llama.cpp compatible with Claude API features?**

A: Basic chat only. Features like vision, web search, and extended thinking require Claude's official API.

## Storage Locations

| Path                  | Description                     |
| --------------------- | ------------------------------- |
| `~/Downloads/models/` | Default location for GGUF files |
| `~/.ccs/config.yaml`  | CCS profile configuration       |
| `llama.cpp/server`    | Binary after build              |

## Next Steps

<CardGroup cols={2}>
  <Card title="API Profiles" icon="server" href="/providers/concepts/api-profiles">
    Learn how to create and manage API profiles
  </Card>

  <Card title="Ollama Provider" icon="boxes" href="/providers/api/ollama">
    Compare with Ollama (cloud/local models)
  </Card>

  <Card title="Model Models" icon="sparkles" href="/providers/concepts/overview">
    Explore all available models and providers
  </Card>

  <Card title="Troubleshooting" icon="wrench" href="/reference/troubleshooting">
    Common issues and solutions
  </Card>
</CardGroup>
