# 1. Start llama.cpp server (in separate terminal)./server --host 0.0.0.0 --port 8080 -m /path/to/model.gguf# 2. Create CCS profileccs api create --preset llamacpp# or use the direct shortcut (creates profile automatically if needed)ccs llamacpp "explain quantum computing"# 3. Use the profileccs llamacpp "explain quantum computing"
# Create profile with defaults (localhost:8080)ccs api create --preset llamacpp# Profile is created with:# Base URL: http://127.0.0.1:8080# Model: llama3-8b# No API key required
# Start with a specific model./server -m /path/to/model.gguf# With default CCS settings (accessible on localhost:8080)./server -m /path/to/model.gguf --host 0.0.0.0 --port 8080
# Override model for this requestANTHROPIC_MODEL=llama3-70b ccs llamacpp "complex system design"# Use different quantization variantANTHROPIC_MODEL=qwen3-coder:q5 ccs llamacpp "debug this issue"
Symptom: Invalid model or very slow responsesSolutions:
# Check available models on servercurl http://127.0.0.1:8080/models# Switch to faster modelANTHROPIC_MODEL=phi3-mini ccs llamacpp "quick task"# Restart server with smaller model./server -m /path/to/smaller-model.gguf
# Find what's using port 8080lsof -i :8080# Kill the process (if it's old llama.cpp instance)kill -9 <PID># Or use a different port./server -m /path/to/model.gguf --host 0.0.0.0 --port 8081# Then configure CCSANTHROPIC_BASE_URL=http://127.0.0.1:8081 ccs llamacpp "test"
Q: Can I use llama.cpp with older hardware?A: Yes, but it will be slow. CPU-only inference works on any machine. Consider smaller models like Phi-3 (2.3GB) which runs acceptably on older hardware.Q: How do I update models?A: Download new GGUF files and restart server with -m /path/to/new-model.gguf.Q: Can I run multiple llama.cpp servers for load balancing?A: Yes, start each on a different port and create separate CCS profiles pointing to each.Q: Is llama.cpp compatible with Claude API features?A: Basic chat only. Features like vision, web search, and extended thinking require Claude’s official API.