flutter_llama 0.1.1 copy "flutter_llama: ^0.1.1" to clipboard
flutter_llama: ^0.1.1 copied to clipboard

Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android and iOS

flutter_llama #

Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android and iOS.

Features #

  • πŸš€ High-performance LLM inference using llama.cpp
  • πŸ“± Native support for Android and iOS
  • ⚑ GPU acceleration (Metal on iOS, Vulkan/OpenCL on Android)
  • πŸ”„ Streaming and blocking text generation
  • 🎯 Full control over generation parameters
  • πŸ“¦ Support for GGUF model format
  • πŸ›  Easy to integrate and use

Installation #

Add this to your package's pubspec.yaml file:

dependencies:
  flutter_llama:
    path: ../flutter_llama  # Adjust path as needed

Then run:

flutter pub get

Usage #

1. Load a Model #

We recommend using the braindler model from Ollama - a compact and efficient model perfect for mobile devices.

import 'package:flutter_llama/flutter_llama.dart';

final llama = FlutterLlama.instance;

// Using braindler Q4_K_M quantization (88MB - optimal balance)
final config = LlamaConfig(
  modelPath: '/path/to/braindler-q4_k_s.gguf',  // braindler from ollama.com/nativemind/braindler
  nThreads: 4,
  nGpuLayers: 0,  // 0 = CPU only, -1 = all layers on GPU
  contextSize: 2048,
  batchSize: 512,
  useGpu: true,
  verbose: false,
);

final success = await llama.loadModel(config);
if (success) {
  print('Braindler model loaded successfully!');
}

2. Generate Text (Blocking) #

final params = GenerationParams(
  prompt: 'Hello, how are you?',
  temperature: 0.8,
  topP: 0.95,
  topK: 40,
  maxTokens: 512,
  repeatPenalty: 1.1,
);

try {
  final response = await llama.generate(params);
  print('Generated: ${response.text}');
  print('Tokens: ${response.tokensGenerated}');
  print('Speed: ${response.tokensPerSecond.toStringAsFixed(2)} tok/s');
} catch (e) {
  print('Error: $e');
}

3. Generate Text (Streaming) #

final params = GenerationParams(
  prompt: 'Tell me a story',
  maxTokens: 1000,
);

try {
  await for (final token in llama.generateStream(params)) {
    print(token); // Print each token as it's generated
  }
} catch (e) {
  print('Error: $e');
}

4. Get Model Info #

final info = await llama.getModelInfo();
if (info != null) {
  print('Model path: ${info['modelPath']}');
  print('Parameters: ${info['nParams']}');
  print('Layers: ${info['nLayers']}');
  print('Context size: ${info['contextSize']}');
}

5. Unload Model #

await llama.unloadModel();
print('Model unloaded');

Configuration Options #

LlamaConfig #

  • modelPath (String, required): Path to the GGUF model file
  • nThreads (int, default: 4): Number of CPU threads to use
  • nGpuLayers (int, default: 0): Number of layers to offload to GPU (0 = CPU only, -1 = all)
  • contextSize (int, default: 2048): Context size in tokens
  • batchSize (int, default: 512): Batch size for processing
  • useGpu (bool, default: true): Enable GPU acceleration
  • verbose (bool, default: false): Enable verbose logging

GenerationParams #

  • prompt (String, required): The prompt for text generation
  • temperature (double, default: 0.8): Sampling temperature (0.0 - 2.0)
  • topP (double, default: 0.95): Top-P sampling (0.0 - 1.0)
  • topK (int, default: 40): Top-K sampling
  • maxTokens (int, default: 512): Maximum tokens to generate
  • repeatPenalty (double, default: 1.1): Penalty for repeating tokens
  • stopSequences (List

Example App #

See the example directory for a complete example application.

Performance Tips #

  1. GPU Acceleration: Set nGpuLayers to offload layers to GPU:

    • iOS (Metal): Set to -1 for all layers
    • Android (Vulkan): Start with 32 and adjust based on device
  2. Threading: Adjust nThreads based on device CPU cores:

    • Mobile devices: 4-6 threads
    • High-end devices: 6-8 threads
  3. Model Size: Use braindler quantized models for better performance:

    • braindler:q2_k (72MB): Smallest, fastest, good quality
    • braindler:q4_k_s (88MB): ⭐ Recommended - Optimal balance
    • braindler:q5_k_m (103MB): Higher quality, larger size
    • braindler:q8_0 (140MB): Best quality, largest size

    Get from: https://ollama.com/nativemind/braindler

  4. Context Size: Reduce if memory is limited:

    • Small devices: 1024-2048
    • Medium devices: 2048-4096
    • High-end devices: 4096-8192

Requirements #

iOS #

  • iOS 13.0 or later
  • Xcode 14.0 or later
  • Metal support for GPU acceleration

Android #

  • Android API level 24 (Android 7.0) or later
  • NDK r25 or later
  • Vulkan support for GPU acceleration (optional)

Building #

The plugin includes native C++ code for llama.cpp integration.

iOS #

The iOS framework will be built automatically when you build your Flutter app.

Android #

The Android native library will be built automatically using CMake/NDK.

GGUF Models #

This plugin supports GGUF model format. We recommend using the braindler model:

Get the braindler model from Ollama:

Available quantizations:

  • braindler:q2_k - 72MB - Fastest, good quality
  • braindler:q3_k_s - 77MB - Better quality
  • braindler:q4_k_s - 88MB - ⭐ Recommended - Optimal balance
  • braindler:q5_k_m - 103MB - Higher quality
  • braindler:q8_0 - 140MB - Best quality
  • braindler:f16 - 256MB - Maximum quality

How to get braindler models:

  1. Install Ollama: https://ollama.com
  2. Pull the model: ollama pull nativemind/braindler:q4_k_s
  3. Export to GGUF: ollama export nativemind/braindler:q4_k_s -o braindler-q4_k_s.gguf
  4. Copy the GGUF file to your app's assets or documents directory

Other model sources: #

Limitations #

  • Model file must be accessible on device storage
  • Large models require significant RAM
  • Generation speed depends on device capabilities
  • Some older devices may not support GPU acceleration

License #

This plugin is released under the NativeMindNONC License. See LICENSE file for details.

llama.cpp is released under the MIT License.

Contributing #

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments #

  • llama.cpp by Georgi Gerganov
  • The Flutter team for the excellent framework
  • Braindler model from Ollama - recommended model for mobile devices
3
likes
140
points
496
downloads

Publisher

verified publisherai.nativemind.net

Weekly Downloads

Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android and iOS

Repository (GitHub)
View/report issues

Documentation

API reference

License

unknown (license)

Dependencies

flutter, plugin_platform_interface

More

Packages that depend on flutter_llama

Packages that implement flutter_llama