flutter_llama

Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android, iOS, and macOS.

🎉 Version 1.0.0 - Production Ready with Full GPU Acceleration!

Features

🚀 High-performance LLM inference using llama.cpp
📱 Native support for Android, iOS, and macOS
⚡ Full GPU acceleration:
- Metal for iOS/macOS (3-10x faster)
- Vulkan for Android (4-8x faster)
- OpenCL fallback (2-5x faster)
- Automatic GPU detection and fallback
🔄 Streaming and blocking text generation
🎯 Full control over generation parameters
📦 GGUF model format (industry standard)
🛠 Easy to integrate and production-ready
✅ 71 unit tests, fully tested

Installation

Add this to your package's pubspec.yaml file:

dependencies:
  flutter_llama:
    path: ../flutter_llama  # Adjust path as needed

Then run:

flutter pub get

Usage

1. Load a Model

We recommend using the braindler model from Ollama - a compact and efficient model perfect for mobile devices.

import 'package:flutter_llama/flutter_llama.dart';

final llama = FlutterLlama.instance;

// Using braindler Q4_K_M quantization (88MB - optimal balance)
final config = LlamaConfig(
  modelPath: '/path/to/braindler-q4_k_s.gguf',  // braindler from ollama.com/nativemind/braindler
  nThreads: 4,
  nGpuLayers: 0,  // 0 = CPU only, -1 = all layers on GPU
  contextSize: 2048,
  batchSize: 512,
  useGpu: true,
  verbose: false,
);

final success = await llama.loadModel(config);
if (success) {
  print('Braindler model loaded successfully!');
}

2. Generate Text (Blocking)

final params = GenerationParams(
  prompt: 'Hello, how are you?',
  temperature: 0.8,
  topP: 0.95,
  topK: 40,
  maxTokens: 512,
  repeatPenalty: 1.1,
);

try {
  final response = await llama.generate(params);
  print('Generated: ${response.text}');
  print('Tokens: ${response.tokensGenerated}');
  print('Speed: ${response.tokensPerSecond.toStringAsFixed(2)} tok/s');
} catch (e) {
  print('Error: $e');
}

3. Generate Text (Streaming)

final params = GenerationParams(
  prompt: 'Tell me a story',
  maxTokens: 1000,
);

try {
  await for (final token in llama.generateStream(params)) {
    print(token); // Print each token as it's generated
  }
} catch (e) {
  print('Error: $e');
}

4. Get Model Info

final info = await llama.getModelInfo();
if (info != null) {
  print('Model path: ${info['modelPath']}');
  print('Parameters: ${info['nParams']}');
  print('Layers: ${info['nLayers']}');
  print('Context size: ${info['contextSize']}');
}

5. Unload Model

await llama.unloadModel();
print('Model unloaded');

Configuration Options

LlamaConfig

modelPath (String, required): Path to the GGUF model file
nThreads (int, default: 4): Number of CPU threads to use
nGpuLayers (int, default: 0): Number of layers to offload to GPU (0 = CPU only, -1 = all)
contextSize (int, default: 2048): Context size in tokens
batchSize (int, default: 512): Batch size for processing
useGpu (bool, default: true): Enable GPU acceleration
verbose (bool, default: false): Enable verbose logging

GenerationParams

prompt (String, required): The prompt for text generation
temperature (double, default: 0.8): Sampling temperature (0.0 - 2.0)
topP (double, default: 0.95): Top-P sampling (0.0 - 1.0)
topK (int, default: 40): Top-K sampling
maxTokens (int, default: 512): Maximum tokens to generate
repeatPenalty (double, default: 1.1): Penalty for repeating tokens
stopSequences (List

Example App

See the example directory for a complete example application.

Performance Tips

🚀 GPU Acceleration (NEW in v1.0.0!)

GPU acceleration is now fully enabled and automatic!

iOS/macOS (Metal):
```
nGpuLayers: -1,  // -1 = all layers on Metal GPU
useGpu: true,    // Metal automatically detected
```
- Performance: 3-10x faster than CPU
- Devices: iPhone 8+, iPad Pro, MacBook Pro
- Expected: ~45-50 tok/s on iPhone 14 Pro
Android (Vulkan/OpenCL):
```
nGpuLayers: -1,  // -1 = all layers on GPU
useGpu: true,    // Vulkan auto-detected, fallback to OpenCL
```
- Performance: 4-8x faster than CPU (Vulkan), 2-5x (OpenCL)
- Devices: Android 7.0+ with Vulkan support
- Expected: ~18-25 tok/s on flagship devices
- Fallback: Automatically uses OpenCL if Vulkan unavailable
Threading (CPU fallback): Adjust nThreads based on device:
- Mobile devices: 4-6 threads
- High-end devices: 6-8 threads
Model Size: Use braindler quantized models for better performance:
- braindler:q2_k (72MB): Smallest, fastest, good quality
- braindler:q4_k_s (88MB): ⭐ Recommended - Optimal balance
- braindler:q5_k_m (103MB): Higher quality, larger size
- braindler:q8_0 (140MB): Best quality, largest size
Get from: https://ollama.com/nativemind/braindler
Context Size: Reduce if memory is limited:
- Small devices: 1024-2048
- Medium devices: 2048-4096
- High-end devices: 4096-8192

Requirements

iOS

iOS 13.0 or later
Xcode 14.0 or later
Metal support for GPU acceleration

Android

Android API level 24 (Android 7.0) or later
NDK r25 or later
Vulkan support for GPU acceleration (optional)

Building

The plugin includes native C++ code for llama.cpp integration.

iOS

The iOS framework will be built automatically when you build your Flutter app.

Android

The Android native library will be built automatically using CMake/NDK.

GGUF Models

This plugin supports GGUF model format. We recommend using the braindler model:

Recommended: Braindler Model

Get the braindler model from Ollama:

Available quantizations:

braindler:q2_k - 72MB - Fastest, good quality
braindler:q3_k_s - 77MB - Better quality
braindler:q4_k_s - 88MB - ⭐ Recommended - Optimal balance
braindler:q5_k_m - 103MB - Higher quality
braindler:q8_0 - 140MB - Best quality
braindler:f16 - 256MB - Maximum quality

How to get braindler models:

Install Ollama: https://ollama.com
Pull the model: ollama pull nativemind/braindler:q4_k_s
Export to GGUF: ollama export nativemind/braindler:q4_k_s -o braindler-q4_k_s.gguf
Copy the GGUF file to your app's assets or documents directory

Other model sources:

Limitations

Model file must be accessible on device storage
Large models require significant RAM
Generation speed depends on device capabilities
Some older devices may not support GPU acceleration

License

This plugin is released under the NativeMindNONC License. See LICENSE file for details.

llama.cpp is released under the MIT License.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

llama.cpp by Georgi Gerganov
The Flutter team for the excellent framework
Braindler model from Ollama - recommended model for mobile devices

flutter_llama

🎉 Version 1.0.0 - Production Ready with Full GPU Acceleration!

Features

Installation

Usage

1. Load a Model

2. Generate Text (Blocking)

3. Generate Text (Streaming)

4. Get Model Info

5. Unload Model

Configuration Options

LlamaConfig

GenerationParams

Example App

Performance Tips

🚀 GPU Acceleration (NEW in v1.0.0!)

Requirements

iOS

Android

Building

iOS

Android

GGUF Models

Recommended: Braindler Model

Other model sources:

Limitations

License

Contributing

Acknowledgments

Libraries

flutter_llama package