flutter_llama 1.1.2
flutter_llama: ^1.1.2 copied to clipboard
Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android and iOS
flutter_llama #
Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android, iOS, and macOS.
π Version 1.0.0 - Production Ready with Full GPU Acceleration! #
Features #
- π High-performance LLM inference using llama.cpp
- π± Native support for Android, iOS, and macOS
- β‘ Full GPU acceleration:
- Metal for iOS/macOS (3-10x faster)
- Vulkan for Android (4-8x faster)
- OpenCL fallback (2-5x faster)
- Automatic GPU detection and fallback
- π Streaming and blocking text generation
- π― Full control over generation parameters
- π¦ GGUF model format (industry standard)
- π Easy to integrate and production-ready
- β 71 unit tests, fully tested
Installation #
Add this to your package's pubspec.yaml file:
dependencies:
flutter_llama:
path: ../flutter_llama # Adjust path as needed
Then run:
flutter pub get
Usage #
1. Load a Model #
We recommend using the braindler model from Ollama - a compact and efficient model perfect for mobile devices.
import 'package:flutter_llama/flutter_llama.dart';
final llama = FlutterLlama.instance;
// Using braindler Q4_K_M quantization (88MB - optimal balance)
final config = LlamaConfig(
modelPath: '/path/to/braindler-q4_k_s.gguf', // braindler from ollama.com/nativemind/braindler
nThreads: 4,
nGpuLayers: 0, // 0 = CPU only, -1 = all layers on GPU
contextSize: 2048,
batchSize: 512,
useGpu: true,
verbose: false,
);
final success = await llama.loadModel(config);
if (success) {
print('Braindler model loaded successfully!');
}
2. Generate Text (Blocking) #
final params = GenerationParams(
prompt: 'Hello, how are you?',
temperature: 0.8,
topP: 0.95,
topK: 40,
maxTokens: 512,
repeatPenalty: 1.1,
);
try {
final response = await llama.generate(params);
print('Generated: ${response.text}');
print('Tokens: ${response.tokensGenerated}');
print('Speed: ${response.tokensPerSecond.toStringAsFixed(2)} tok/s');
} catch (e) {
print('Error: $e');
}
3. Generate Text (Streaming) #
final params = GenerationParams(
prompt: 'Tell me a story',
maxTokens: 1000,
);
try {
await for (final token in llama.generateStream(params)) {
print(token); // Print each token as it's generated
}
} catch (e) {
print('Error: $e');
}
4. Get Model Info #
final info = await llama.getModelInfo();
if (info != null) {
print('Model path: ${info['modelPath']}');
print('Parameters: ${info['nParams']}');
print('Layers: ${info['nLayers']}');
print('Context size: ${info['contextSize']}');
}
5. Unload Model #
await llama.unloadModel();
print('Model unloaded');
Configuration Options #
LlamaConfig #
modelPath(String, required): Path to the GGUF model filenThreads(int, default: 4): Number of CPU threads to usenGpuLayers(int, default: 0): Number of layers to offload to GPU (0 = CPU only, -1 = all)contextSize(int, default: 2048): Context size in tokensbatchSize(int, default: 512): Batch size for processinguseGpu(bool, default: true): Enable GPU accelerationverbose(bool, default: false): Enable verbose logging
GenerationParams #
prompt(String, required): The prompt for text generationtemperature(double, default: 0.8): Sampling temperature (0.0 - 2.0)topP(double, default: 0.95): Top-P sampling (0.0 - 1.0)topK(int, default: 40): Top-K samplingmaxTokens(int, default: 512): Maximum tokens to generaterepeatPenalty(double, default: 1.1): Penalty for repeating tokensstopSequences(List
Example App #
See the example directory for a complete example application.
Performance Tips #
π GPU Acceleration (NEW in v1.0.0!) #
GPU acceleration is now fully enabled and automatic!
-
iOS/macOS (Metal):
nGpuLayers: -1, // -1 = all layers on Metal GPU useGpu: true, // Metal automatically detected- Performance: 3-10x faster than CPU
- Devices: iPhone 8+, iPad Pro, MacBook Pro
- Expected: ~45-50 tok/s on iPhone 14 Pro
-
Android (Vulkan/OpenCL):
nGpuLayers: -1, // -1 = all layers on GPU useGpu: true, // Vulkan auto-detected, fallback to OpenCL- Performance: 4-8x faster than CPU (Vulkan), 2-5x (OpenCL)
- Devices: Android 7.0+ with Vulkan support
- Expected: ~18-25 tok/s on flagship devices
- Fallback: Automatically uses OpenCL if Vulkan unavailable
-
Threading (CPU fallback): Adjust
nThreadsbased on device:- Mobile devices: 4-6 threads
- High-end devices: 6-8 threads
-
Model Size: Use braindler quantized models for better performance:
- braindler:q2_k (72MB): Smallest, fastest, good quality
- braindler:q4_k_s (88MB): β Recommended - Optimal balance
- braindler:q5_k_m (103MB): Higher quality, larger size
- braindler:q8_0 (140MB): Best quality, largest size
Get from: https://ollama.com/nativemind/braindler
-
Context Size: Reduce if memory is limited:
- Small devices: 1024-2048
- Medium devices: 2048-4096
- High-end devices: 4096-8192
Requirements #
iOS #
- iOS 13.0 or later
- Xcode 14.0 or later
- Metal support for GPU acceleration
Android #
- Android API level 24 (Android 7.0) or later
- NDK r25 or later
- Vulkan support for GPU acceleration (optional)
Building #
The plugin includes native C++ code for llama.cpp integration.
iOS #
The iOS framework will be built automatically when you build your Flutter app.
Android #
The Android native library will be built automatically using CMake/NDK.
GGUF Models #
This plugin supports GGUF model format. We recommend using the braindler model:
Recommended: Braindler Model #
Get the braindler model from Ollama:
Available quantizations:
braindler:q2_k- 72MB - Fastest, good qualitybraindler:q3_k_s- 77MB - Better qualitybraindler:q4_k_s- 88MB - β Recommended - Optimal balancebraindler:q5_k_m- 103MB - Higher qualitybraindler:q8_0- 140MB - Best qualitybraindler:f16- 256MB - Maximum quality
How to get braindler models:
- Install Ollama: https://ollama.com
- Pull the model:
ollama pull nativemind/braindler:q4_k_s - Export to GGUF:
ollama export nativemind/braindler:q4_k_s -o braindler-q4_k_s.gguf - Copy the GGUF file to your app's assets or documents directory
Other model sources: #
Limitations #
- Model file must be accessible on device storage
- Large models require significant RAM
- Generation speed depends on device capabilities
- Some older devices may not support GPU acceleration
License #
This plugin is released under the NativeMindNONC License. See LICENSE file for details.
llama.cpp is released under the MIT License.
Contributing #
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments #
- llama.cpp by Georgi Gerganov
- The Flutter team for the excellent framework
- Braindler model from Ollama - recommended model for mobile devices