flutter_llama
Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android and iOS.
Features
- π High-performance LLM inference using llama.cpp
- π± Native support for Android and iOS
- β‘ GPU acceleration (Metal on iOS, Vulkan/OpenCL on Android)
- π Streaming and blocking text generation
- π― Full control over generation parameters
- π¦ Support for GGUF model format
- π Easy to integrate and use
Installation
Add this to your package's pubspec.yaml file:
dependencies:
flutter_llama:
path: ../flutter_llama # Adjust path as needed
Then run:
flutter pub get
Usage
1. Load a Model
We recommend using the braindler model from Ollama - a compact and efficient model perfect for mobile devices.
import 'package:flutter_llama/flutter_llama.dart';
final llama = FlutterLlama.instance;
// Using braindler Q4_K_M quantization (88MB - optimal balance)
final config = LlamaConfig(
modelPath: '/path/to/braindler-q4_k_s.gguf', // braindler from ollama.com/nativemind/braindler
nThreads: 4,
nGpuLayers: 0, // 0 = CPU only, -1 = all layers on GPU
contextSize: 2048,
batchSize: 512,
useGpu: true,
verbose: false,
);
final success = await llama.loadModel(config);
if (success) {
print('Braindler model loaded successfully!');
}
2. Generate Text (Blocking)
final params = GenerationParams(
prompt: 'Hello, how are you?',
temperature: 0.8,
topP: 0.95,
topK: 40,
maxTokens: 512,
repeatPenalty: 1.1,
);
try {
final response = await llama.generate(params);
print('Generated: ${response.text}');
print('Tokens: ${response.tokensGenerated}');
print('Speed: ${response.tokensPerSecond.toStringAsFixed(2)} tok/s');
} catch (e) {
print('Error: $e');
}
3. Generate Text (Streaming)
final params = GenerationParams(
prompt: 'Tell me a story',
maxTokens: 1000,
);
try {
await for (final token in llama.generateStream(params)) {
print(token); // Print each token as it's generated
}
} catch (e) {
print('Error: $e');
}
4. Get Model Info
final info = await llama.getModelInfo();
if (info != null) {
print('Model path: ${info['modelPath']}');
print('Parameters: ${info['nParams']}');
print('Layers: ${info['nLayers']}');
print('Context size: ${info['contextSize']}');
}
5. Unload Model
await llama.unloadModel();
print('Model unloaded');
Configuration Options
LlamaConfig
modelPath(String, required): Path to the GGUF model filenThreads(int, default: 4): Number of CPU threads to usenGpuLayers(int, default: 0): Number of layers to offload to GPU (0 = CPU only, -1 = all)contextSize(int, default: 2048): Context size in tokensbatchSize(int, default: 512): Batch size for processinguseGpu(bool, default: true): Enable GPU accelerationverbose(bool, default: false): Enable verbose logging
GenerationParams
prompt(String, required): The prompt for text generationtemperature(double, default: 0.8): Sampling temperature (0.0 - 2.0)topP(double, default: 0.95): Top-P sampling (0.0 - 1.0)topK(int, default: 40): Top-K samplingmaxTokens(int, default: 512): Maximum tokens to generaterepeatPenalty(double, default: 1.1): Penalty for repeating tokensstopSequences(List
Example App
See the example directory for a complete example application.
Performance Tips
-
GPU Acceleration: Set
nGpuLayersto offload layers to GPU:- iOS (Metal): Set to
-1for all layers - Android (Vulkan): Start with
32and adjust based on device
- iOS (Metal): Set to
-
Threading: Adjust
nThreadsbased on device CPU cores:- Mobile devices: 4-6 threads
- High-end devices: 6-8 threads
-
Model Size: Use braindler quantized models for better performance:
- braindler:q2_k (72MB): Smallest, fastest, good quality
- braindler:q4_k_s (88MB): β Recommended - Optimal balance
- braindler:q5_k_m (103MB): Higher quality, larger size
- braindler:q8_0 (140MB): Best quality, largest size
Get from: https://ollama.com/nativemind/braindler
-
Context Size: Reduce if memory is limited:
- Small devices: 1024-2048
- Medium devices: 2048-4096
- High-end devices: 4096-8192
Requirements
iOS
- iOS 13.0 or later
- Xcode 14.0 or later
- Metal support for GPU acceleration
Android
- Android API level 24 (Android 7.0) or later
- NDK r25 or later
- Vulkan support for GPU acceleration (optional)
Building
The plugin includes native C++ code for llama.cpp integration.
iOS
The iOS framework will be built automatically when you build your Flutter app.
Android
The Android native library will be built automatically using CMake/NDK.
GGUF Models
This plugin supports GGUF model format. We recommend using the braindler model:
Recommended: Braindler Model
Get the braindler model from Ollama:
Available quantizations:
braindler:q2_k- 72MB - Fastest, good qualitybraindler:q3_k_s- 77MB - Better qualitybraindler:q4_k_s- 88MB - β Recommended - Optimal balancebraindler:q5_k_m- 103MB - Higher qualitybraindler:q8_0- 140MB - Best qualitybraindler:f16- 256MB - Maximum quality
How to get braindler models:
- Install Ollama: https://ollama.com
- Pull the model:
ollama pull nativemind/braindler:q4_k_s - Export to GGUF:
ollama export nativemind/braindler:q4_k_s -o braindler-q4_k_s.gguf - Copy the GGUF file to your app's assets or documents directory
Other model sources:
Limitations
- Model file must be accessible on device storage
- Large models require significant RAM
- Generation speed depends on device capabilities
- Some older devices may not support GPU acceleration
License
This plugin is released under the NativeMindNONC License. See LICENSE file for details.
llama.cpp is released under the MIT License.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
- llama.cpp by Georgi Gerganov
- The Flutter team for the excellent framework
- Braindler model from Ollama - recommended model for mobile devices