flutter_llama 1.1.2 copy "flutter_llama: ^1.1.2" to clipboard
flutter_llama: ^1.1.2 copied to clipboard

Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android and iOS

flutter_llama #

Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android, iOS, and macOS.

πŸŽ‰ Version 1.0.0 - Production Ready with Full GPU Acceleration! #

Features #

  • πŸš€ High-performance LLM inference using llama.cpp
  • πŸ“± Native support for Android, iOS, and macOS
  • ⚑ Full GPU acceleration:
    • Metal for iOS/macOS (3-10x faster)
    • Vulkan for Android (4-8x faster)
    • OpenCL fallback (2-5x faster)
    • Automatic GPU detection and fallback
  • πŸ”„ Streaming and blocking text generation
  • 🎯 Full control over generation parameters
  • πŸ“¦ GGUF model format (industry standard)
  • πŸ›  Easy to integrate and production-ready
  • βœ… 71 unit tests, fully tested

Installation #

Add this to your package's pubspec.yaml file:

dependencies:
  flutter_llama:
    path: ../flutter_llama  # Adjust path as needed

Then run:

flutter pub get

Usage #

1. Load a Model #

We recommend using the braindler model from Ollama - a compact and efficient model perfect for mobile devices.

import 'package:flutter_llama/flutter_llama.dart';

final llama = FlutterLlama.instance;

// Using braindler Q4_K_M quantization (88MB - optimal balance)
final config = LlamaConfig(
  modelPath: '/path/to/braindler-q4_k_s.gguf',  // braindler from ollama.com/nativemind/braindler
  nThreads: 4,
  nGpuLayers: 0,  // 0 = CPU only, -1 = all layers on GPU
  contextSize: 2048,
  batchSize: 512,
  useGpu: true,
  verbose: false,
);

final success = await llama.loadModel(config);
if (success) {
  print('Braindler model loaded successfully!');
}

2. Generate Text (Blocking) #

final params = GenerationParams(
  prompt: 'Hello, how are you?',
  temperature: 0.8,
  topP: 0.95,
  topK: 40,
  maxTokens: 512,
  repeatPenalty: 1.1,
);

try {
  final response = await llama.generate(params);
  print('Generated: ${response.text}');
  print('Tokens: ${response.tokensGenerated}');
  print('Speed: ${response.tokensPerSecond.toStringAsFixed(2)} tok/s');
} catch (e) {
  print('Error: $e');
}

3. Generate Text (Streaming) #

final params = GenerationParams(
  prompt: 'Tell me a story',
  maxTokens: 1000,
);

try {
  await for (final token in llama.generateStream(params)) {
    print(token); // Print each token as it's generated
  }
} catch (e) {
  print('Error: $e');
}

4. Get Model Info #

final info = await llama.getModelInfo();
if (info != null) {
  print('Model path: ${info['modelPath']}');
  print('Parameters: ${info['nParams']}');
  print('Layers: ${info['nLayers']}');
  print('Context size: ${info['contextSize']}');
}

5. Unload Model #

await llama.unloadModel();
print('Model unloaded');

Configuration Options #

LlamaConfig #

  • modelPath (String, required): Path to the GGUF model file
  • nThreads (int, default: 4): Number of CPU threads to use
  • nGpuLayers (int, default: 0): Number of layers to offload to GPU (0 = CPU only, -1 = all)
  • contextSize (int, default: 2048): Context size in tokens
  • batchSize (int, default: 512): Batch size for processing
  • useGpu (bool, default: true): Enable GPU acceleration
  • verbose (bool, default: false): Enable verbose logging

GenerationParams #

  • prompt (String, required): The prompt for text generation
  • temperature (double, default: 0.8): Sampling temperature (0.0 - 2.0)
  • topP (double, default: 0.95): Top-P sampling (0.0 - 1.0)
  • topK (int, default: 40): Top-K sampling
  • maxTokens (int, default: 512): Maximum tokens to generate
  • repeatPenalty (double, default: 1.1): Penalty for repeating tokens
  • stopSequences (List

Example App #

See the example directory for a complete example application.

Performance Tips #

πŸš€ GPU Acceleration (NEW in v1.0.0!) #

GPU acceleration is now fully enabled and automatic!

  1. iOS/macOS (Metal):

    nGpuLayers: -1,  // -1 = all layers on Metal GPU
    useGpu: true,    // Metal automatically detected
    
    • Performance: 3-10x faster than CPU
    • Devices: iPhone 8+, iPad Pro, MacBook Pro
    • Expected: ~45-50 tok/s on iPhone 14 Pro
  2. Android (Vulkan/OpenCL):

    nGpuLayers: -1,  // -1 = all layers on GPU
    useGpu: true,    // Vulkan auto-detected, fallback to OpenCL
    
    • Performance: 4-8x faster than CPU (Vulkan), 2-5x (OpenCL)
    • Devices: Android 7.0+ with Vulkan support
    • Expected: ~18-25 tok/s on flagship devices
    • Fallback: Automatically uses OpenCL if Vulkan unavailable
  3. Threading (CPU fallback): Adjust nThreads based on device:

    • Mobile devices: 4-6 threads
    • High-end devices: 6-8 threads
  4. Model Size: Use braindler quantized models for better performance:

    • braindler:q2_k (72MB): Smallest, fastest, good quality
    • braindler:q4_k_s (88MB): ⭐ Recommended - Optimal balance
    • braindler:q5_k_m (103MB): Higher quality, larger size
    • braindler:q8_0 (140MB): Best quality, largest size

    Get from: https://ollama.com/nativemind/braindler

  5. Context Size: Reduce if memory is limited:

    • Small devices: 1024-2048
    • Medium devices: 2048-4096
    • High-end devices: 4096-8192

Requirements #

iOS #

  • iOS 13.0 or later
  • Xcode 14.0 or later
  • Metal support for GPU acceleration

Android #

  • Android API level 24 (Android 7.0) or later
  • NDK r25 or later
  • Vulkan support for GPU acceleration (optional)

Building #

The plugin includes native C++ code for llama.cpp integration.

iOS #

The iOS framework will be built automatically when you build your Flutter app.

Android #

The Android native library will be built automatically using CMake/NDK.

GGUF Models #

This plugin supports GGUF model format. We recommend using the braindler model:

Get the braindler model from Ollama:

Available quantizations:

  • braindler:q2_k - 72MB - Fastest, good quality
  • braindler:q3_k_s - 77MB - Better quality
  • braindler:q4_k_s - 88MB - ⭐ Recommended - Optimal balance
  • braindler:q5_k_m - 103MB - Higher quality
  • braindler:q8_0 - 140MB - Best quality
  • braindler:f16 - 256MB - Maximum quality

How to get braindler models:

  1. Install Ollama: https://ollama.com
  2. Pull the model: ollama pull nativemind/braindler:q4_k_s
  3. Export to GGUF: ollama export nativemind/braindler:q4_k_s -o braindler-q4_k_s.gguf
  4. Copy the GGUF file to your app's assets or documents directory

Other model sources: #

Limitations #

  • Model file must be accessible on device storage
  • Large models require significant RAM
  • Generation speed depends on device capabilities
  • Some older devices may not support GPU acceleration

License #

This plugin is released under the NativeMindNONC License. See LICENSE file for details.

llama.cpp is released under the MIT License.

Contributing #

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments #

  • llama.cpp by Georgi Gerganov
  • The Flutter team for the excellent framework
  • Braindler model from Ollama - recommended model for mobile devices
3
likes
140
points
475
downloads

Publisher

verified publisherai.nativemind.net

Weekly Downloads

Flutter plugin for running LLM inference with llama.cpp and GGUF models on Android and iOS

Repository (GitHub)
View/report issues

Documentation

API reference

License

unknown (license)

Dependencies

flutter, http, path, path_provider, plugin_platform_interface

More

Packages that depend on flutter_llama

Packages that implement flutter_llama