llm_toolkit 0.0.4 copy "llm_toolkit: ^0.0.4" to clipboard
llm_toolkit: ^0.0.4 copied to clipboard

A comprehensive Flutter SDK for running Large Language Models (LLMs) locally on mobile and desktop devices. Supports multiple inference engines including Gemma (TFLite) and Llama (GGUF) with integrate [...]

LLM Toolkit SDK Changelog #

0.0.4 July 4, 2025 #

🎤 New Features - Speech Recognition (ASR) #

  • Complete ASR Integration - Full TensorFlow Lite ASR engine with Whisper model support
  • Real-time Speech Recognition - Live streaming transcription with continuous audio processing
  • Multi-format Audio Support - WAV, PCM16, PCM32, Float32 audio format compatibility
  • Advanced Audio Preprocessing - Whisper-compatible mel-spectrogram generation with FFT processing
  • Voice Activity Detection (VAD) - Smart silence detection and automatic recording termination
  • Chunked Audio Processing - Efficient handling of long audio files with overlap management
  • ASR Service Layer - High-level service wrapper for easy speech recognition integration
  • Microphone Integration - Native microphone access with permission handling

🔊 ASR Technical Features #

  • Enhanced Audio Pipeline - Optimized mel-filter bank computation with isolate processing
  • Adaptive Processing - Dynamic audio length optimization based on content characteristics
  • Silence Detection - Advanced RMS-based silence detection with configurable thresholds
  • Audio Resampling - High-quality interpolation-based audio resampling for format compatibility
  • Model Type Detection - Automatic ASR model type detection (Whisper, Wav2Vec2, DeepSpeech)
  • Streaming Optimization - Low-latency streaming with configurable chunk sizes and overlap
  • Memory Efficient - Optimized for mobile devices with minimal memory footprint
  • Error Recovery - Robust error handling with graceful fallbacks for audio processing

🎯 ASR Configuration Options #

  • ASRConfig Class - Comprehensive configuration for sample rates, bit rates, and streaming parameters
  • Quality Presets - Pre-built configurations for high-quality, low-latency, mobile, and streaming scenarios
  • VAD Settings - Configurable voice activity detection with silence timeouts and thresholds
  • Performance Tuning - Adjustable confidence thresholds and token limits for optimal performance
  • Streaming Parameters - Customizable streaming intervals and chunk processing settings

🔧 ASR Service Methods #

  • File Transcription - transcribeFile() for processing audio files with enhanced chunking
  • Byte Transcription - transcribeBytes() for direct audio data processing
  • Live Recording - startRecording() / stopRecording() for real-time audio capture
  • Streaming Recognition - startStreamingTranscription() for continuous live transcription
  • VAD Recording - recordWithVAD() for intelligent voice-activated recording
  • Timed Recording - recordAndTranscribe() for fixed-duration audio capture

🎤 ASR Audio Processing #

  • WAV Header Parsing - Complete WAV file format validation and metadata extraction
  • Multi-channel Support - Automatic stereo-to-mono conversion for ASR compatibility
  • Audio Validation - Comprehensive audio file validation with detailed error reporting
  • Format Detection - Automatic audio format detection and appropriate decoder selection
  • Quality Metrics - Audio quality assessment and processing recommendations

⚠️ Current Limitations #

  • ASR Status: ASR functionality is implemented but currently experiencing stability issues and may not work reliably in all scenarios
  • Multimodal Status: Multimodal image+text generation is implemented but not thoroughly tested - use with caution
  • Platform Compatibility: ASR features may have varying performance across different Android devices
  • Model Compatibility: Some ASR models may require specific preprocessing that hasn't been fully optimized

🔧 Improvements from Previous Version #

  • Enhanced Multimodal Support - Improved Gemma multimodal integration with proper image handling
  • Better Engine Detection - More accurate ASR model type detection and configuration
  • Optimized Memory Usage - Reduced memory footprint for ASR processing on mobile devices
  • Improved Error Messages - More descriptive error messages for ASR-related issues
  • Debug Enhancements - Extended debug console with ASR-specific logging and metrics

🐛 Bug Fixes #

  • Fixed audio preprocessing pipeline for Whisper model compatibility
  • Resolved memory leaks in continuous ASR processing
  • Corrected mel-spectrogram generation for various audio durations
  • Fixed streaming audio buffer management and overlap handling
  • Resolved TensorFlow Lite model loading issues for ASR engines
  • Fixed microphone permission handling across different Android versions

📚 Documentation Updates #

  • ASR Integration Guide - Complete guide for implementing speech recognition features
  • Audio Processing Tutorial - Technical documentation for audio preprocessing and format handling
  • Streaming Best Practices - Guidelines for optimal real-time speech recognition performance
  • Troubleshooting Guide - Common ASR issues and their solutions
  • Model Compatibility Matrix - Supported ASR models and their specific requirements

🔮 Upcoming in Next Release #

  • ASR Stability Improvements - Focus on resolving current ASR reliability issues
  • Multimodal Testing - Comprehensive testing and validation of image+text generation
  • Additional ASR Models - Support for more ASR model architectures and formats
  • Performance Optimizations - Further memory and speed optimizations for mobile deployment

0.0.3 July 1, 2025 #

🚀 New Features #

  • RAG (Retrieval-Augmented Generation) Architecture - Complete modular RAG system with engine-agnostic design
  • Multi-Engine RAG Support - Extensible architecture supporting GGUF, SGML, TFLite, and future inference engines
  • Document Management System - Add, remove, and manage documents with automatic chunking and embedding generation
  • Intelligent Document Chunking - Multiple chunking strategies (sentence, paragraph, fixed-size) with configurable overlap
  • Semantic Search - Cosine similarity-based document retrieval with confidence scoring
  • Context-Aware Generation - Generate answers using relevant document chunks as context
  • Llama RAG Engine - First implementation using llama_cpp_dart for both embeddings and text generation
  • Flexible Configuration - Customizable similarity thresholds, chunk sizes, and generation parameters

🔧 RAG System Components #

  • RagEngine Interface - Abstract base class for implementing different RAG backends
  • DocumentChunk Model - Structured document representation with metadata and relevance scoring
  • TextChunker Utility - Advanced text splitting with sentence-aware chunking and overlap management
  • SimilarityCalculator - Multiple similarity metrics (cosine, euclidean) for document matching
  • RagService Integration - Seamless integration with existing LLM service architecture
  • Batch Processing - Efficient batch embedding generation for large document collections

🧠 RAG Features #

  • Document Metadata Support - Rich metadata storage and filtering capabilities
  • Multi-Document Support - Handle multiple documents with unique identification
  • Relevance Scoring - Confidence-based ranking of retrieved document chunks
  • Dynamic Context Building - Intelligent context assembly from multiple relevant chunks
  • Memory-Efficient Processing - Optimized for mobile devices with limited memory
  • Streaming Integration - Compatible with existing streaming text generation

🔧 Technical Improvements #

  • Modular Architecture - Clean separation between RAG core, engines, and utilities
  • Engine Abstraction - Easy addition of new inference engines (TFLite, ONNX, etc.)
  • Error Handling - Comprehensive error management for document processing and retrieval
  • Performance Optimization - Efficient embedding caching and similarity computation
  • Memory Management - Smart memory usage for large document collections
  • Async Operations - Non-blocking document processing and query handling

📚 RAG Configuration Options #

  • Chunking Configuration - Customizable chunk size, overlap, and splitting strategies
  • Retrieval Parameters - Adjustable similarity thresholds and maximum relevant chunks
  • Generation Settings - Temperature, max tokens, and system prompt customization
  • Engine-Specific Settings - Tailored configurations for different inference engines

🛠️ Implementation Details #

  • LlamaRagEngine - Complete implementation using llama_cpp_dart for embeddings and generation
  • Document Processing Pipeline - Automated chunking, embedding, and indexing workflow
  • Query Processing - Efficient query embedding and similarity search
  • Context Assembly - Smart context building with relevance-based chunk selection
  • Response Generation - Context-aware answer generation with source attribution

🔧 Improvements from Previous Version #

  • Enhanced Model Browser with tabbed interface (Search, Recommended, Downloaded)
  • File Browser Integration - Load models directly from device storage
  • Device-Specific Recommendations - Memory-aware model suggestions based on device capabilities
  • Advanced Debug Console - Real-time logging with color-coded categories and filtering
  • Improved Model Cards - Better file organization, download status, and compatibility indicators
  • Native Library Validation - Automatic detection of llama.cpp compatibility and health checks
  • Progressive Model Loading - Fallback configurations for memory-constrained devices
  • Enhanced Error Handling - Detailed error messages with troubleshooting tips and recovery suggestions

🐛 Bug Fixes #

  • Fixed SIGSEGV crashes with problematic quantizations on Android ARM64 devices
  • Resolved memory leaks during model loading/unloading cycles
  • Fixed download interruption handling and resume functionality
  • Corrected model file validation for corrupted GGUF files
  • Fixed context size calculation for low-memory devices
  • Resolved engine detection issues for certain model formats
  • Fixed UI state management during concurrent operations

📚 Documentation Updates #

  • RAG Architecture Guide - Comprehensive documentation for implementing RAG systems
  • Engine Integration Tutorial - Step-by-step guide for adding new inference engines
  • Document Processing Best Practices - Guidelines for optimal chunking and embedding strategies
  • Performance Optimization - RAG-specific performance tuning recommendations
  • API Reference - Complete API documentation for RAG components and utilities

🎯 Future RAG Roadmap #

  • Vector Database Integration - Support for external vector databases (Pinecone, Weaviate)
  • Advanced Retrieval Strategies - Hybrid search, re-ranking, and query expansion
  • Multi-Modal RAG - Support for images, PDFs, and other document formats
  • Distributed RAG - Cloud-based document processing and retrieval
  • RAG Analytics - Performance metrics and retrieval quality analysis

0.0.2 June 29, 2025 #

🚀 New Features #

  • Enhanced Model Browser with tabbed interface (Search, Recommended, Downloaded)
  • File Browser Integration - Load models directly from device storage
  • Device-Specific Recommendations - Memory-aware model suggestions based on device capabilities
  • Advanced Debug Console - Real-time logging with color-coded categories and filtering
  • Improved Model Cards - Better file organization, download status, and compatibility indicators
  • Native Library Validation - Automatic detection of llama.cpp compatibility and health checks
  • Progressive Model Loading - Fallback configurations for memory-constrained devices
  • Enhanced Error Handling - Detailed error messages with troubleshooting tips and recovery suggestions

🔧 Improvements #

  • Quantization Safety Checks - Automatic detection and warnings for unstable quantizations (Q2_K, IQ1_S, IQ1_M)
  • Memory Management - Dynamic context size adjustment based on available device memory
  • Download Progress - Real-time progress indicators with speed metrics and ETA
  • Model Detection - Improved engine detection for GGUF and TFLite files with better accuracy
  • UI/UX Enhancements - Material Design 3 with gradient themes, better accessibility, and responsive design
  • Performance Monitoring - Token generation speed tracking, memory usage reporting, and performance analytics
  • Search Optimization - Better model discovery with multiple search strategies and relevance ranking

🐛 Bug Fixes #

  • Fixed SIGSEGV crashes with problematic quantizations on Android ARM64 devices
  • Resolved memory leaks during model loading/unloading cycles
  • Fixed download interruption handling and resume functionality
  • Corrected model file validation for corrupted GGUF files
  • Fixed context size calculation for low-memory devices
  • Resolved engine detection issues for certain model formats
  • Fixed UI state management during concurrent operations

📚 Documentation #

  • Added comprehensive model compatibility guide with device-specific recommendations
  • Enhanced troubleshooting documentation for Android devices with common solutions
  • Updated API documentation with new configuration options and examples
  • Added performance optimization guidelines for mobile deployment

🛠️ Technical Improvements #

  • Isolate-based Model Loading - Better crash protection and memory isolation
  • Enhanced Logging System - Categorized debug logs with filtering and export capabilities
  • Improved Error Recovery - Graceful fallback mechanisms for failed operations
  • Better Resource Management - Optimized memory usage and cleanup procedures

0.0.1 June 27, 2025 #

🎉 Initial Release #

  • Dual Engine Support - Llama (GGUF) and Gemma (TFLite) inference engines
  • HuggingFace Integration - Direct model search and download from HuggingFace Hub
  • Model Management - Automatic model detection and configuration
  • Chat Interface - Real-time streaming text generation with conversation history
  • Cross-Platform - Support for Android and iOS devices

🔧 Core Features #

  • Model Search - Search and filter models by format, size, and compatibility
  • Automatic Downloads - Background model downloading with progress tracking
  • Engine Detection - Automatic selection of appropriate inference engine based on model format
  • Memory Optimization - Dynamic configuration based on device capabilities
  • Error Handling - Comprehensive exception handling and user feedback

📱 Supported Formats #

  • GGUF - Quantized models for efficient inference (Q4_K_M, Q4_0, Q5_K_M, Q8_0)
  • TFLite - TensorFlow Lite models optimized for mobile devices
  • Model Types - Support for Gemma, Llama, Phi, DeepSeek, and custom models

🛠️ Technical Stack #

  • Flutter Framework - Cross-platform mobile development
  • llama_cpp_dart - Native GGUF model inference with ARM64 optimization
  • flutter_gemma - TFLite model inference with GPU acceleration
  • Dio HTTP Client - Efficient model downloading and API communication
  • Path Provider - Cross-platform file system access and management

🎯 Initial Capabilities #

  • Text Generation - Streaming text generation with customizable parameters
  • Model Discovery - Browse and search thousands of compatible models
  • Local Storage - Efficient model caching and management
  • Debug Tools - Basic logging and error reporting
  • Configuration Management - Flexible inference configuration options
5
likes
110
points
117
downloads

Publisher

unverified uploader

Weekly Downloads

A comprehensive Flutter SDK for running Large Language Models (LLMs) locally on mobile and desktop devices. Supports multiple inference engines including Gemma (TFLite) and Llama (GGUF) with integrated model discovery, download, and chat capabilities.

Repository (GitHub)
View/report issues

Documentation

API reference

License

MIT (license)

Dependencies

collection, crypto, dio, fftea, flutter, flutter_gemma, http, json_annotation, llama_cpp_dart, objectbox, objectbox_flutter_libs, path, path_provider, record, tflite_flutter

More

Packages that depend on llm_toolkit