flutter_llama

Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android and iOS.

Features

  • πŸš€ High-performance LLM inference using llama.cpp
  • πŸ“± Native support for Android and iOS
  • ⚑ GPU acceleration (Metal on iOS, Vulkan/OpenCL on Android)
  • πŸ”„ Streaming and blocking text generation
  • 🎯 Full control over generation parameters
  • πŸ“¦ Support for GGUF model format
  • πŸ›  Easy to integrate and use

Installation

Add this to your package's pubspec.yaml file:

dependencies:
  flutter_llama:
    path: ../flutter_llama  # Adjust path as needed

Then run:

flutter pub get

Usage

1. Load a Model

We recommend using the braindler model from Ollama - a compact and efficient model perfect for mobile devices.

import 'package:flutter_llama/flutter_llama.dart';

final llama = FlutterLlama.instance;

// Using braindler Q4_K_M quantization (88MB - optimal balance)
final config = LlamaConfig(
  modelPath: '/path/to/braindler-q4_k_s.gguf',  // braindler from ollama.com/nativemind/braindler
  nThreads: 4,
  nGpuLayers: 0,  // 0 = CPU only, -1 = all layers on GPU
  contextSize: 2048,
  batchSize: 512,
  useGpu: true,
  verbose: false,
);

final success = await llama.loadModel(config);
if (success) {
  print('Braindler model loaded successfully!');
}

2. Generate Text (Blocking)

final params = GenerationParams(
  prompt: 'Hello, how are you?',
  temperature: 0.8,
  topP: 0.95,
  topK: 40,
  maxTokens: 512,
  repeatPenalty: 1.1,
);

try {
  final response = await llama.generate(params);
  print('Generated: ${response.text}');
  print('Tokens: ${response.tokensGenerated}');
  print('Speed: ${response.tokensPerSecond.toStringAsFixed(2)} tok/s');
} catch (e) {
  print('Error: $e');
}

3. Generate Text (Streaming)

final params = GenerationParams(
  prompt: 'Tell me a story',
  maxTokens: 1000,
);

try {
  await for (final token in llama.generateStream(params)) {
    print(token); // Print each token as it's generated
  }
} catch (e) {
  print('Error: $e');
}

4. Get Model Info

final info = await llama.getModelInfo();
if (info != null) {
  print('Model path: ${info['modelPath']}');
  print('Parameters: ${info['nParams']}');
  print('Layers: ${info['nLayers']}');
  print('Context size: ${info['contextSize']}');
}

5. Unload Model

await llama.unloadModel();
print('Model unloaded');

Configuration Options

LlamaConfig

  • modelPath (String, required): Path to the GGUF model file
  • nThreads (int, default: 4): Number of CPU threads to use
  • nGpuLayers (int, default: 0): Number of layers to offload to GPU (0 = CPU only, -1 = all)
  • contextSize (int, default: 2048): Context size in tokens
  • batchSize (int, default: 512): Batch size for processing
  • useGpu (bool, default: true): Enable GPU acceleration
  • verbose (bool, default: false): Enable verbose logging

GenerationParams

  • prompt (String, required): The prompt for text generation
  • temperature (double, default: 0.8): Sampling temperature (0.0 - 2.0)
  • topP (double, default: 0.95): Top-P sampling (0.0 - 1.0)
  • topK (int, default: 40): Top-K sampling
  • maxTokens (int, default: 512): Maximum tokens to generate
  • repeatPenalty (double, default: 1.1): Penalty for repeating tokens
  • stopSequences (List

Example App

See the example directory for a complete example application.

Performance Tips

  1. GPU Acceleration: Set nGpuLayers to offload layers to GPU:

    • iOS (Metal): Set to -1 for all layers
    • Android (Vulkan): Start with 32 and adjust based on device
  2. Threading: Adjust nThreads based on device CPU cores:

    • Mobile devices: 4-6 threads
    • High-end devices: 6-8 threads
  3. Model Size: Use braindler quantized models for better performance:

    • braindler:q2_k (72MB): Smallest, fastest, good quality
    • braindler:q4_k_s (88MB): ⭐ Recommended - Optimal balance
    • braindler:q5_k_m (103MB): Higher quality, larger size
    • braindler:q8_0 (140MB): Best quality, largest size

    Get from: https://ollama.com/nativemind/braindler

  4. Context Size: Reduce if memory is limited:

    • Small devices: 1024-2048
    • Medium devices: 2048-4096
    • High-end devices: 4096-8192

Requirements

iOS

  • iOS 13.0 or later
  • Xcode 14.0 or later
  • Metal support for GPU acceleration

Android

  • Android API level 24 (Android 7.0) or later
  • NDK r25 or later
  • Vulkan support for GPU acceleration (optional)

Building

The plugin includes native C++ code for llama.cpp integration.

iOS

The iOS framework will be built automatically when you build your Flutter app.

Android

The Android native library will be built automatically using CMake/NDK.

GGUF Models

This plugin supports GGUF model format. We recommend using the braindler model:

Get the braindler model from Ollama:

Available quantizations:

  • braindler:q2_k - 72MB - Fastest, good quality
  • braindler:q3_k_s - 77MB - Better quality
  • braindler:q4_k_s - 88MB - ⭐ Recommended - Optimal balance
  • braindler:q5_k_m - 103MB - Higher quality
  • braindler:q8_0 - 140MB - Best quality
  • braindler:f16 - 256MB - Maximum quality

How to get braindler models:

  1. Install Ollama: https://ollama.com
  2. Pull the model: ollama pull nativemind/braindler:q4_k_s
  3. Export to GGUF: ollama export nativemind/braindler:q4_k_s -o braindler-q4_k_s.gguf
  4. Copy the GGUF file to your app's assets or documents directory

Other model sources:

Limitations

  • Model file must be accessible on device storage
  • Large models require significant RAM
  • Generation speed depends on device capabilities
  • Some older devices may not support GPU acceleration

License

This plugin is released under the NativeMindNONC License. See LICENSE file for details.

llama.cpp is released under the MIT License.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

  • llama.cpp by Georgi Gerganov
  • The Flutter team for the excellent framework
  • Braindler model from Ollama - recommended model for mobile devices