flutter_llama
Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android, iOS, and macOS.
π Version 1.0.0 - Production Ready with Full GPU Acceleration!
Features
- π High-performance LLM inference using llama.cpp
- π± Native support for Android, iOS, and macOS
- β‘ Full GPU acceleration:
- Metal for iOS/macOS (3-10x faster)
- Vulkan for Android (4-8x faster)
- OpenCL fallback (2-5x faster)
- Automatic GPU detection and fallback
- π Streaming and blocking text generation
- π― Full control over generation parameters
- π¦ GGUF model format (industry standard)
- π Easy to integrate and production-ready
- β 71 unit tests, fully tested
Installation
Add this to your package's pubspec.yaml file:
dependencies:
flutter_llama:
path: ../flutter_llama # Adjust path as needed
Then run:
flutter pub get
Usage
1. Load a Model
We recommend using the braindler model from Ollama - a compact and efficient model perfect for mobile devices.
import 'package:flutter_llama/flutter_llama.dart';
final llama = FlutterLlama.instance;
// Using braindler Q4_K_M quantization (88MB - optimal balance)
final config = LlamaConfig(
modelPath: '/path/to/braindler-q4_k_s.gguf', // braindler from ollama.com/nativemind/braindler
nThreads: 4,
nGpuLayers: 0, // 0 = CPU only, -1 = all layers on GPU
contextSize: 2048,
batchSize: 512,
useGpu: true,
verbose: false,
);
final success = await llama.loadModel(config);
if (success) {
print('Braindler model loaded successfully!');
}
2. Generate Text (Blocking)
final params = GenerationParams(
prompt: 'Hello, how are you?',
temperature: 0.8,
topP: 0.95,
topK: 40,
maxTokens: 512,
repeatPenalty: 1.1,
);
try {
final response = await llama.generate(params);
print('Generated: ${response.text}');
print('Tokens: ${response.tokensGenerated}');
print('Speed: ${response.tokensPerSecond.toStringAsFixed(2)} tok/s');
} catch (e) {
print('Error: $e');
}
3. Generate Text (Streaming)
final params = GenerationParams(
prompt: 'Tell me a story',
maxTokens: 1000,
);
try {
await for (final token in llama.generateStream(params)) {
print(token); // Print each token as it's generated
}
} catch (e) {
print('Error: $e');
}
4. Get Model Info
final info = await llama.getModelInfo();
if (info != null) {
print('Model path: ${info['modelPath']}');
print('Parameters: ${info['nParams']}');
print('Layers: ${info['nLayers']}');
print('Context size: ${info['contextSize']}');
}
5. Unload Model
await llama.unloadModel();
print('Model unloaded');
Configuration Options
LlamaConfig
modelPath(String, required): Path to the GGUF model filenThreads(int, default: 4): Number of CPU threads to usenGpuLayers(int, default: 0): Number of layers to offload to GPU (0 = CPU only, -1 = all)contextSize(int, default: 2048): Context size in tokensbatchSize(int, default: 512): Batch size for processinguseGpu(bool, default: true): Enable GPU accelerationverbose(bool, default: false): Enable verbose logging
GenerationParams
prompt(String, required): The prompt for text generationtemperature(double, default: 0.8): Sampling temperature (0.0 - 2.0)topP(double, default: 0.95): Top-P sampling (0.0 - 1.0)topK(int, default: 40): Top-K samplingmaxTokens(int, default: 512): Maximum tokens to generaterepeatPenalty(double, default: 1.1): Penalty for repeating tokensstopSequences(List
Example App
See the example directory for a complete example application.
Performance Tips
π GPU Acceleration (NEW in v1.0.0!)
GPU acceleration is now fully enabled and automatic!
-
iOS/macOS (Metal):
nGpuLayers: -1, // -1 = all layers on Metal GPU useGpu: true, // Metal automatically detected- Performance: 3-10x faster than CPU
- Devices: iPhone 8+, iPad Pro, MacBook Pro
- Expected: ~45-50 tok/s on iPhone 14 Pro
-
Android (Vulkan/OpenCL):
nGpuLayers: -1, // -1 = all layers on GPU useGpu: true, // Vulkan auto-detected, fallback to OpenCL- Performance: 4-8x faster than CPU (Vulkan), 2-5x (OpenCL)
- Devices: Android 7.0+ with Vulkan support
- Expected: ~18-25 tok/s on flagship devices
- Fallback: Automatically uses OpenCL if Vulkan unavailable
-
Threading (CPU fallback): Adjust
nThreadsbased on device:- Mobile devices: 4-6 threads
- High-end devices: 6-8 threads
-
Model Size: Use braindler quantized models for better performance:
- braindler:q2_k (72MB): Smallest, fastest, good quality
- braindler:q4_k_s (88MB): β Recommended - Optimal balance
- braindler:q5_k_m (103MB): Higher quality, larger size
- braindler:q8_0 (140MB): Best quality, largest size
Get from: https://ollama.com/nativemind/braindler
-
Context Size: Reduce if memory is limited:
- Small devices: 1024-2048
- Medium devices: 2048-4096
- High-end devices: 4096-8192
Requirements
iOS
- iOS 13.0 or later
- Xcode 14.0 or later
- Metal support for GPU acceleration
Android
- Android API level 24 (Android 7.0) or later
- NDK r25 or later
- Vulkan support for GPU acceleration (optional)
Building
The plugin includes native C++ code for llama.cpp integration.
iOS
The iOS framework will be built automatically when you build your Flutter app.
Android
The Android native library will be built automatically using CMake/NDK.
GGUF Models
This plugin supports GGUF model format. We recommend using the braindler model:
Recommended: Braindler Model
Get the braindler model from Ollama:
Available quantizations:
braindler:q2_k- 72MB - Fastest, good qualitybraindler:q3_k_s- 77MB - Better qualitybraindler:q4_k_s- 88MB - β Recommended - Optimal balancebraindler:q5_k_m- 103MB - Higher qualitybraindler:q8_0- 140MB - Best qualitybraindler:f16- 256MB - Maximum quality
How to get braindler models:
- Install Ollama: https://ollama.com
- Pull the model:
ollama pull nativemind/braindler:q4_k_s - Export to GGUF:
ollama export nativemind/braindler:q4_k_s -o braindler-q4_k_s.gguf - Copy the GGUF file to your app's assets or documents directory
Other model sources:
Limitations
- Model file must be accessible on device storage
- Large models require significant RAM
- Generation speed depends on device capabilities
- Some older devices may not support GPU acceleration
License
This plugin is released under the NativeMindNONC License. See LICENSE file for details.
llama.cpp is released under the MIT License.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
- llama.cpp by Georgi Gerganov
- The Flutter team for the excellent framework
- Braindler model from Ollama - recommended model for mobile devices