Skip to content

Latest commit

 

History

History
145 lines (106 loc) · 4.21 KB

File metadata and controls

145 lines (106 loc) · 4.21 KB

Benchmarks (Reproducible)

Shimmy benchmarks focus on real‑world performance over theoretical peaks. Since users care about loading models, generating text under memory constraints, and stable throughput across configurations, that's what we measure.

Benchmark Ethos: Every listed benchmark includes commands to reproduce it and the exact system config. We're betting on transparency and reproducibility over cherry‑picked marketing numbers.

Quick Performance Table

Model Size Load Time First Token Tokens/sec Memory
phi3-mini 2.4GB 1.2s 220ms 24.3 3.1GB
llama-7b 4.1GB 2.8s 410ms 15.7 5.9GB
mistral-7b 4.1GB 2.6s 380ms 18.2 5.7GB

System: Intel i7‑12700K, 32GB RAM, NVMe SSD. Measured with shimmy bench.

Update this table when adding new models or measuring new systems. Keep the config line accurate.

How to Run Benchmarks

1. Built‑in Benchmark Suite

# Download or symlink a model to test
export SHIMMY_BASE_GGUF=/path/to/model.gguf

# Run standard benchmark (load + generation)
cargo run --features llama --release -- bench phi3

# Extended benchmark with memory profiling
cargo run --features llama --release -- bench phi3 --extended

# Save results to JSON
cargo run --features llama --release -- bench phi3 --output benchmark.json

2. Manual Load + Generation Test

# Terminal 1: Start server
cargo run --features llama --release -- serve --bind 127.0.0.1:11435

# Terminal 2: Time first response
time curl -X POST http://127.0.0.1:11435/api/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "phi3",
    "prompt": "Count to 10",
    "max_tokens": 50,
    "stream": false
  }'

3. Stress Test (Multiple Concurrent)

# 5 concurrent requests, 100 tokens each
for i in {1..5}; do
  (curl -X POST http://127.0.0.1:11435/api/generate \
    -H 'Content-Type: application/json' \
    -d '{"model":"phi3","prompt":"Generate a short poem","max_tokens":100}' &)
done
wait

System Configurations Tested

Config A: Development Laptop

  • CPU: Intel i7‑12700K (P‑cores: 8, E‑cores: 4)
  • RAM: 32GB DDR4‑3200
  • Storage: 1TB NVMe (Samsung 980 Pro)
  • OS: Ubuntu 22.04
  • Rust: 1.75.0

Config B: M2 MacBook Pro

  • CPU: Apple M2 Pro (10‑core CPU, 16‑core GPU)
  • RAM: 16GB unified memory
  • Storage: 512GB SSD
  • OS: macOS 14.1
  • Rust: 1.74.0

Config C: Production Server

  • CPU: AMD EPYC 7713 (64‑core)
  • RAM: 128GB DDR4‑3200
  • Storage: 2TB NVMe RAID
  • OS: Ubuntu 22.04 Server
  • Rust: 1.75.0

Add your config if you run benchmarks and contribute results.

Benchmark Results Archive

2024‑01‑15: phi3‑mini Performance

Config A (i7‑12700K):

  • Load time: 1.24s ± 0.06s
  • First token: 218ms ± 15ms
  • Sustained tokens/sec: 24.3 ± 1.2
  • Memory usage: 3.1GB peak

Commands to reproduce:

export SHIMMY_BASE_GGUF=./models/phi3-mini-4k-instruct-q4_0.gguf
cargo run --features llama --release -- bench phi3 --runs 5

2024‑01‑12: Memory Scaling Test

Testing memory usage across model sizes on Config A:

Model Size Memory Peak Overhead
2GB 3.1GB +55%
4GB 5.9GB +48%
7GB 9.2GB +31%

Shows decreasing overhead percentage as models scale.

Adding Your Benchmark

  1. Run the test with your system config documented.
  2. Note exact commands used to reproduce.
  3. Include system specs (CPU, RAM, storage, OS).
  4. Submit PR adding results to this file.

We accept all legitimate results, even if they show Shimmy performing poorly — transparency builds trust.

Known Performance Notes

  • First run slower due to filesystem cache warming.
  • Memory usage stabilizes after ~30 seconds of generation.
  • Token throughput decreases with longer contexts (expected).
  • SSD vs HDD: ~3x difference in model load times.

Future Benchmarks

  • CUDA/ROCm GPU acceleration comparison
  • Multi‑model serving overhead
  • Long context performance (8k+ tokens)
  • LoRA adapter switching latency
  • Cross‑platform binary size comparison

Last updated: 2024‑01‑15. Keep this date current when adding results.