Shimmy benchmarks focus on real‑world performance over theoretical peaks. Since users care about loading models, generating text under memory constraints, and stable throughput across configurations, that's what we measure.
Benchmark Ethos: Every listed benchmark includes commands to reproduce it and the exact system config. We're betting on transparency and reproducibility over cherry‑picked marketing numbers.
| Model | Size | Load Time | First Token | Tokens/sec | Memory |
|---|---|---|---|---|---|
phi3-mini |
2.4GB | 1.2s | 220ms | 24.3 | 3.1GB |
llama-7b |
4.1GB | 2.8s | 410ms | 15.7 | 5.9GB |
mistral-7b |
4.1GB | 2.6s | 380ms | 18.2 | 5.7GB |
System: Intel i7‑12700K, 32GB RAM, NVMe SSD. Measured with shimmy bench.
Update this table when adding new models or measuring new systems. Keep the config line accurate.
# Download or symlink a model to test
export SHIMMY_BASE_GGUF=/path/to/model.gguf
# Run standard benchmark (load + generation)
cargo run --features llama --release -- bench phi3
# Extended benchmark with memory profiling
cargo run --features llama --release -- bench phi3 --extended
# Save results to JSON
cargo run --features llama --release -- bench phi3 --output benchmark.json# Terminal 1: Start server
cargo run --features llama --release -- serve --bind 127.0.0.1:11435
# Terminal 2: Time first response
time curl -X POST http://127.0.0.1:11435/api/generate \
-H 'Content-Type: application/json' \
-d '{
"model": "phi3",
"prompt": "Count to 10",
"max_tokens": 50,
"stream": false
}'# 5 concurrent requests, 100 tokens each
for i in {1..5}; do
(curl -X POST http://127.0.0.1:11435/api/generate \
-H 'Content-Type: application/json' \
-d '{"model":"phi3","prompt":"Generate a short poem","max_tokens":100}' &)
done
wait- CPU: Intel i7‑12700K (P‑cores: 8, E‑cores: 4)
- RAM: 32GB DDR4‑3200
- Storage: 1TB NVMe (Samsung 980 Pro)
- OS: Ubuntu 22.04
- Rust: 1.75.0
- CPU: Apple M2 Pro (10‑core CPU, 16‑core GPU)
- RAM: 16GB unified memory
- Storage: 512GB SSD
- OS: macOS 14.1
- Rust: 1.74.0
- CPU: AMD EPYC 7713 (64‑core)
- RAM: 128GB DDR4‑3200
- Storage: 2TB NVMe RAID
- OS: Ubuntu 22.04 Server
- Rust: 1.75.0
Add your config if you run benchmarks and contribute results.
Config A (i7‑12700K):
- Load time: 1.24s ± 0.06s
- First token: 218ms ± 15ms
- Sustained tokens/sec: 24.3 ± 1.2
- Memory usage: 3.1GB peak
Commands to reproduce:
export SHIMMY_BASE_GGUF=./models/phi3-mini-4k-instruct-q4_0.gguf
cargo run --features llama --release -- bench phi3 --runs 5Testing memory usage across model sizes on Config A:
| Model Size | Memory Peak | Overhead |
|---|---|---|
| 2GB | 3.1GB | +55% |
| 4GB | 5.9GB | +48% |
| 7GB | 9.2GB | +31% |
Shows decreasing overhead percentage as models scale.
- Run the test with your system config documented.
- Note exact commands used to reproduce.
- Include system specs (CPU, RAM, storage, OS).
- Submit PR adding results to this file.
We accept all legitimate results, even if they show Shimmy performing poorly — transparency builds trust.
- First run slower due to filesystem cache warming.
- Memory usage stabilizes after ~30 seconds of generation.
- Token throughput decreases with longer contexts (expected).
- SSD vs HDD: ~3x difference in model load times.
- CUDA/ROCm GPU acceleration comparison
- Multi‑model serving overhead
- Long context performance (8k+ tokens)
- LoRA adapter switching latency
- Cross‑platform binary size comparison
Last updated: 2024‑01‑15. Keep this date current when adding results.