A comprehensive WPF (Windows Presentation Foundation) wrapper for llama.cpp, providing an intuitive graphical interface for managing and interacting with local large language models.
-
Multi-Variant Support: Download and manage multiple llama.cpp variants:
- CPU (AVX/AVX2/AVX-512)
- CUDA (NVIDIA GPU)
- Vulkan (cross-platform GPU)
- HIP/ROCm (AMD GPU)
- SYCL (Intel GPU)
-
Server Management:
- Start/stop local llama.cpp server instances
- Real-time server logs and monitoring
- Loading indicator while the model initializes
- Comprehensive configurable server parameters (see Server Parameters)
-
Web-Based Chat:
- Automatically downloads llama.cpp's built-in WebUI
- Launch Chat Client button opens the chat interface in your default browser once the server is ready
- Full-featured web chat with streaming responses, conversation history, and model settings
-
Theme Support:
- Toggle between dark and light themes at any time
-
Settings Persistence:
- All server configuration and preferences are saved automatically
- Option to suppress the startup welcome screen
-
Automatic Updates:
- Check for the latest llama.cpp releases from GitHub
- One-click download and installation
- Download progress display with cancel support
- Version tracking for installed variants
- Windows 10/11
- .NET 8.0 Runtime or SDK
- A GGUF format model file (can be downloaded from Hugging Face)
- Clone the repository:
git clone https://github.com/dmeldrum6/Llama-Forge.git
cd Llama-Forge- Build the project:
cd LlamaForge
dotnet build- Run the application:
dotnet runOn first launch, a welcome screen walks you through the five setup steps. You can disable it via the "Don't show this screen again" checkbox.
- Launch Llama Forge
- Navigate to the Download / Update tab
- Select your preferred variant:
- CUDA — NVIDIA GPU (recommended if you have an NVIDIA card)
- Vulkan — Cross-platform GPU acceleration
- HIP/ROCm — AMD GPU
- SYCL — Intel GPU
- CPU — CPU-only execution
- Click Check for Updates to see the latest available version
- Click Download Selected to download and install
Download a GGUF model file. Some popular options:
Recommended for testing: small models like Phi-3-mini-4k-instruct-q4.gguf or Llama-3.2-1B-Instruct-Q4_K_M.gguf.
- Navigate to the Server tab
- Select the llama.cpp variant you downloaded
- Click Browse... and select your GGUF model file
- Adjust settings as needed (see Server Parameters below)
- Click Start Server
- A loading indicator will appear while the model initializes — wait for it to complete
Once the model finishes loading, the Launch Chat Client button in the Server tab becomes active. Click it to open llama.cpp's built-in web chat interface in your default browser.
| Parameter | Default | Description |
|---|---|---|
| Model Path | (empty) | Path to your GGUF model file |
| Host | 127.0.0.1 |
Server listen address |
| Port | 8080 |
Server port (1–65535) |
| Context Size | 2048 |
Maximum context length in tokens |
| Threads | 4 |
CPU threads for prompt processing |
| Batch Size | 512 |
Prompt processing batch size |
| Batch Threads | 4 |
Threads used for batching |
| Parallel Slots | 1 |
Number of parallel request slots |
| Continuous Batching | false |
Enable continuous batching |
| GPU Layers | 0 |
Model layers to offload to GPU (0 = CPU only) |
| Memory Lock | false |
Lock model in RAM to prevent swapping |
| Disable Memory Mapping | false |
Disable mmap for model loading |
| Model Alias | (empty) | Alias name reported by the API |
| API Key | (empty) | Optional API key for server access |
| Timeout | 600 |
Request timeout in seconds |
| Enable Embeddings | false |
Expose the embeddings endpoint |
| System Prompt | You are a helpful assistant |
Default system prompt for chat |
| Temperature | 0.7 |
Sampling temperature (0.0–2.0) |
| Max Tokens | 2048 |
Maximum tokens per response |
| Max Chat History | 20 |
Number of past messages sent as context |
| Verbose Logging | false |
Enable detailed server log output |
| Additional Args | (empty) | Extra llama.cpp command-line arguments |
- Download the variant that matches your GPU (CUDA → NVIDIA, HIP/ROCm → AMD, Vulkan → any modern GPU, SYCL → Intel)
- Set GPU Layers to a value greater than
0- Start with
32and increase until you run out of VRAM - More layers = faster inference but higher VRAM usage
- Start with
- Ensure the appropriate drivers are installed for your GPU
Click Auto-Detect next to the Threads field to automatically set the value to your logical CPU core count.
LlamaForge/
├── Controls/ # Custom UI controls
│ └── MessageContentControl.xaml # Content rendering control (text and syntax-highlighted code blocks)
├── Helpers/ # Utility classes
│ ├── InverseBooleanConverter.cs
│ └── InverseBooleanToVisibilityConverter.cs
├── Models/ # Data models
│ ├── AppSettings.cs # Persisted application settings
│ ├── ChatMessage.cs # Chat message representation
│ ├── DownloadableVariant.cs # Downloadable release asset info
│ ├── GitHubRelease.cs # GitHub API release model
│ ├── LlamaVariant.cs # Variant type definitions
│ ├── ModelInfo.cs # Model metadata from llama.cpp API
│ └── ServerConfig.cs # Full server configuration
├── Services/ # Core services
│ ├── GitHubService.cs # GitHub API integration & binary management
│ ├── LlamaChatClient.cs # Chat API client (streaming)
│ ├── LlamaServerManager.cs # Server process lifecycle management
│ └── SettingsService.cs # Load/save settings from disk
├── ViewModels/ # MVVM view models
│ └── MainViewModel.cs # Central application state and commands
├── App.xaml # Application entry point & startup screen logic
├── MainWindow.xaml # Main application window
├── StartupScreen.xaml # First-run welcome/onboarding screen
└── LlamaForge.csproj # Project file
Llama Forge follows the MVVM (Model-View-ViewModel) pattern:
- Models: Data structures for chat messages, server configuration, settings, and model metadata
- Services: Business logic — GitHub API calls, server process management, chat streaming, settings persistence
- ViewModels: Bridges views and services; manages all UI state and commands
- Controls: Custom WPF controls (e.g.,
MessageContentControlfor rendering text with syntax-highlighted code blocks)
- Settings:
%LocalAppData%\LlamaForge\settings.json - Downloaded llama.cpp variants:
%LocalAppData%\LlamaForge\llama-cpp\<variant>\ - WebUI files: stored alongside the llama.cpp variant binaries
- Verify the correct variant is downloaded (check the Download / Update tab)
- Confirm the model file path is valid
- Check Server Logs for specific error messages
- Ensure the configured port is not already in use
- A loading indicator appears after the server process starts; wait for it to complete before the button becomes active
- If it takes unusually long, check the Server Logs tab for errors during model loading
- Confirm you downloaded the correct variant (CUDA/HIP/Vulkan/SYCL)
- Set GPU Layers to a value greater than
0 - Verify GPU drivers are installed and up to date
- Review Server Logs for GPU detection messages
- Check your internet connection
- Verify that GitHub is reachable
- Try a different variant
- Use the Cancel button and retry
- Increase GPU Layers if you have a GPU
- Use a smaller or more quantized model (e.g., Q4 instead of Q8)
- Reduce Context Size
- Click Auto-Detect to optimize your thread count
| Package | Version | Purpose |
|---|---|---|
Newtonsoft.Json |
13.0.3 | JSON serialization |
CommunityToolkit.Mvvm |
8.2.2 | MVVM helpers and relay commands |
AvalonEdit |
6.3.0.90 | Syntax-highlighted code block rendering |
Llama Forge integrates with llama.cpp through:
- Process Management: Spawns
llama-server.exeas a managed child process - HTTP API: Communicates via llama.cpp's HTTP API:
GET /health— Server health check and model load statusPOST /v1/chat/completions— OpenAI-compatible streaming chat endpointGET /v1/models— Retrieve loaded model information
Phase 1 (Current):
- WPF UI with dark/light theme support
- Server management with real-time logs
- Web-based chat via llama.cpp's built-in WebUI (launched from the app)
- Multi-variant support (CPU, CUDA, Vulkan, ROCm, SYCL)
- Automatic llama.cpp updates with progress and cancel
- Settings persistence
- Startup welcome screen
- WebUI integration with browser launch
- Loading indicator during model initialization
- Model metadata display (architecture, parameter count, etc.)
- Configurable system prompt, temperature, and chat history
Phase 2 (Future):
- Model download manager (download GGUF files from within the app)
- Preset server configurations
- Multiple simultaneous server instances
- Model quantization tools
This project is licensed under the MIT License — see the LICENSE file for details.
- llama.cpp — The C++ LLM inference engine powering this app
- Georgi Gerganov — Creator of llama.cpp
- The open-source AI community
Note: Llama Forge is a GUI wrapper. All AI inference is performed by llama.cpp. Model quality and performance depend on the underlying llama.cpp implementation and the models you use.