llm_toolkit 0.0.4
llm_toolkit: ^0.0.4 copied to clipboard
A comprehensive Flutter SDK for running Large Language Models (LLMs) locally on mobile and desktop devices. Supports multiple inference engines including Gemma (TFLite) and Llama (GGUF) with integrate [...]
LLM Toolkit SDK Changelog #
0.0.4 July 4, 2025 #
🎤 New Features - Speech Recognition (ASR) #
- Complete ASR Integration - Full TensorFlow Lite ASR engine with Whisper model support
- Real-time Speech Recognition - Live streaming transcription with continuous audio processing
- Multi-format Audio Support - WAV, PCM16, PCM32, Float32 audio format compatibility
- Advanced Audio Preprocessing - Whisper-compatible mel-spectrogram generation with FFT processing
- Voice Activity Detection (VAD) - Smart silence detection and automatic recording termination
- Chunked Audio Processing - Efficient handling of long audio files with overlap management
- ASR Service Layer - High-level service wrapper for easy speech recognition integration
- Microphone Integration - Native microphone access with permission handling
🔊 ASR Technical Features #
- Enhanced Audio Pipeline - Optimized mel-filter bank computation with isolate processing
- Adaptive Processing - Dynamic audio length optimization based on content characteristics
- Silence Detection - Advanced RMS-based silence detection with configurable thresholds
- Audio Resampling - High-quality interpolation-based audio resampling for format compatibility
- Model Type Detection - Automatic ASR model type detection (Whisper, Wav2Vec2, DeepSpeech)
- Streaming Optimization - Low-latency streaming with configurable chunk sizes and overlap
- Memory Efficient - Optimized for mobile devices with minimal memory footprint
- Error Recovery - Robust error handling with graceful fallbacks for audio processing
🎯 ASR Configuration Options #
- ASRConfig Class - Comprehensive configuration for sample rates, bit rates, and streaming parameters
- Quality Presets - Pre-built configurations for high-quality, low-latency, mobile, and streaming scenarios
- VAD Settings - Configurable voice activity detection with silence timeouts and thresholds
- Performance Tuning - Adjustable confidence thresholds and token limits for optimal performance
- Streaming Parameters - Customizable streaming intervals and chunk processing settings
🔧 ASR Service Methods #
- File Transcription -
transcribeFile()
for processing audio files with enhanced chunking - Byte Transcription -
transcribeBytes()
for direct audio data processing - Live Recording -
startRecording()
/stopRecording()
for real-time audio capture - Streaming Recognition -
startStreamingTranscription()
for continuous live transcription - VAD Recording -
recordWithVAD()
for intelligent voice-activated recording - Timed Recording -
recordAndTranscribe()
for fixed-duration audio capture
🎤 ASR Audio Processing #
- WAV Header Parsing - Complete WAV file format validation and metadata extraction
- Multi-channel Support - Automatic stereo-to-mono conversion for ASR compatibility
- Audio Validation - Comprehensive audio file validation with detailed error reporting
- Format Detection - Automatic audio format detection and appropriate decoder selection
- Quality Metrics - Audio quality assessment and processing recommendations
⚠️ Current Limitations #
- ASR Status: ASR functionality is implemented but currently experiencing stability issues and may not work reliably in all scenarios
- Multimodal Status: Multimodal image+text generation is implemented but not thoroughly tested - use with caution
- Platform Compatibility: ASR features may have varying performance across different Android devices
- Model Compatibility: Some ASR models may require specific preprocessing that hasn't been fully optimized
🔧 Improvements from Previous Version #
- Enhanced Multimodal Support - Improved Gemma multimodal integration with proper image handling
- Better Engine Detection - More accurate ASR model type detection and configuration
- Optimized Memory Usage - Reduced memory footprint for ASR processing on mobile devices
- Improved Error Messages - More descriptive error messages for ASR-related issues
- Debug Enhancements - Extended debug console with ASR-specific logging and metrics
🐛 Bug Fixes #
- Fixed audio preprocessing pipeline for Whisper model compatibility
- Resolved memory leaks in continuous ASR processing
- Corrected mel-spectrogram generation for various audio durations
- Fixed streaming audio buffer management and overlap handling
- Resolved TensorFlow Lite model loading issues for ASR engines
- Fixed microphone permission handling across different Android versions
📚 Documentation Updates #
- ASR Integration Guide - Complete guide for implementing speech recognition features
- Audio Processing Tutorial - Technical documentation for audio preprocessing and format handling
- Streaming Best Practices - Guidelines for optimal real-time speech recognition performance
- Troubleshooting Guide - Common ASR issues and their solutions
- Model Compatibility Matrix - Supported ASR models and their specific requirements
🔮 Upcoming in Next Release #
- ASR Stability Improvements - Focus on resolving current ASR reliability issues
- Multimodal Testing - Comprehensive testing and validation of image+text generation
- Additional ASR Models - Support for more ASR model architectures and formats
- Performance Optimizations - Further memory and speed optimizations for mobile deployment
0.0.3 July 1, 2025 #
🚀 New Features #
- RAG (Retrieval-Augmented Generation) Architecture - Complete modular RAG system with engine-agnostic design
- Multi-Engine RAG Support - Extensible architecture supporting GGUF, SGML, TFLite, and future inference engines
- Document Management System - Add, remove, and manage documents with automatic chunking and embedding generation
- Intelligent Document Chunking - Multiple chunking strategies (sentence, paragraph, fixed-size) with configurable overlap
- Semantic Search - Cosine similarity-based document retrieval with confidence scoring
- Context-Aware Generation - Generate answers using relevant document chunks as context
- Llama RAG Engine - First implementation using llama_cpp_dart for both embeddings and text generation
- Flexible Configuration - Customizable similarity thresholds, chunk sizes, and generation parameters
🔧 RAG System Components #
- RagEngine Interface - Abstract base class for implementing different RAG backends
- DocumentChunk Model - Structured document representation with metadata and relevance scoring
- TextChunker Utility - Advanced text splitting with sentence-aware chunking and overlap management
- SimilarityCalculator - Multiple similarity metrics (cosine, euclidean) for document matching
- RagService Integration - Seamless integration with existing LLM service architecture
- Batch Processing - Efficient batch embedding generation for large document collections
🧠 RAG Features #
- Document Metadata Support - Rich metadata storage and filtering capabilities
- Multi-Document Support - Handle multiple documents with unique identification
- Relevance Scoring - Confidence-based ranking of retrieved document chunks
- Dynamic Context Building - Intelligent context assembly from multiple relevant chunks
- Memory-Efficient Processing - Optimized for mobile devices with limited memory
- Streaming Integration - Compatible with existing streaming text generation
🔧 Technical Improvements #
- Modular Architecture - Clean separation between RAG core, engines, and utilities
- Engine Abstraction - Easy addition of new inference engines (TFLite, ONNX, etc.)
- Error Handling - Comprehensive error management for document processing and retrieval
- Performance Optimization - Efficient embedding caching and similarity computation
- Memory Management - Smart memory usage for large document collections
- Async Operations - Non-blocking document processing and query handling
📚 RAG Configuration Options #
- Chunking Configuration - Customizable chunk size, overlap, and splitting strategies
- Retrieval Parameters - Adjustable similarity thresholds and maximum relevant chunks
- Generation Settings - Temperature, max tokens, and system prompt customization
- Engine-Specific Settings - Tailored configurations for different inference engines
🛠️ Implementation Details #
- LlamaRagEngine - Complete implementation using llama_cpp_dart for embeddings and generation
- Document Processing Pipeline - Automated chunking, embedding, and indexing workflow
- Query Processing - Efficient query embedding and similarity search
- Context Assembly - Smart context building with relevance-based chunk selection
- Response Generation - Context-aware answer generation with source attribution
🔧 Improvements from Previous Version #
- Enhanced Model Browser with tabbed interface (Search, Recommended, Downloaded)
- File Browser Integration - Load models directly from device storage
- Device-Specific Recommendations - Memory-aware model suggestions based on device capabilities
- Advanced Debug Console - Real-time logging with color-coded categories and filtering
- Improved Model Cards - Better file organization, download status, and compatibility indicators
- Native Library Validation - Automatic detection of llama.cpp compatibility and health checks
- Progressive Model Loading - Fallback configurations for memory-constrained devices
- Enhanced Error Handling - Detailed error messages with troubleshooting tips and recovery suggestions
🐛 Bug Fixes #
- Fixed SIGSEGV crashes with problematic quantizations on Android ARM64 devices
- Resolved memory leaks during model loading/unloading cycles
- Fixed download interruption handling and resume functionality
- Corrected model file validation for corrupted GGUF files
- Fixed context size calculation for low-memory devices
- Resolved engine detection issues for certain model formats
- Fixed UI state management during concurrent operations
📚 Documentation Updates #
- RAG Architecture Guide - Comprehensive documentation for implementing RAG systems
- Engine Integration Tutorial - Step-by-step guide for adding new inference engines
- Document Processing Best Practices - Guidelines for optimal chunking and embedding strategies
- Performance Optimization - RAG-specific performance tuning recommendations
- API Reference - Complete API documentation for RAG components and utilities
🎯 Future RAG Roadmap #
- Vector Database Integration - Support for external vector databases (Pinecone, Weaviate)
- Advanced Retrieval Strategies - Hybrid search, re-ranking, and query expansion
- Multi-Modal RAG - Support for images, PDFs, and other document formats
- Distributed RAG - Cloud-based document processing and retrieval
- RAG Analytics - Performance metrics and retrieval quality analysis
0.0.2 June 29, 2025 #
🚀 New Features #
- Enhanced Model Browser with tabbed interface (Search, Recommended, Downloaded)
- File Browser Integration - Load models directly from device storage
- Device-Specific Recommendations - Memory-aware model suggestions based on device capabilities
- Advanced Debug Console - Real-time logging with color-coded categories and filtering
- Improved Model Cards - Better file organization, download status, and compatibility indicators
- Native Library Validation - Automatic detection of llama.cpp compatibility and health checks
- Progressive Model Loading - Fallback configurations for memory-constrained devices
- Enhanced Error Handling - Detailed error messages with troubleshooting tips and recovery suggestions
🔧 Improvements #
- Quantization Safety Checks - Automatic detection and warnings for unstable quantizations (Q2_K, IQ1_S, IQ1_M)
- Memory Management - Dynamic context size adjustment based on available device memory
- Download Progress - Real-time progress indicators with speed metrics and ETA
- Model Detection - Improved engine detection for GGUF and TFLite files with better accuracy
- UI/UX Enhancements - Material Design 3 with gradient themes, better accessibility, and responsive design
- Performance Monitoring - Token generation speed tracking, memory usage reporting, and performance analytics
- Search Optimization - Better model discovery with multiple search strategies and relevance ranking
🐛 Bug Fixes #
- Fixed SIGSEGV crashes with problematic quantizations on Android ARM64 devices
- Resolved memory leaks during model loading/unloading cycles
- Fixed download interruption handling and resume functionality
- Corrected model file validation for corrupted GGUF files
- Fixed context size calculation for low-memory devices
- Resolved engine detection issues for certain model formats
- Fixed UI state management during concurrent operations
📚 Documentation #
- Added comprehensive model compatibility guide with device-specific recommendations
- Enhanced troubleshooting documentation for Android devices with common solutions
- Updated API documentation with new configuration options and examples
- Added performance optimization guidelines for mobile deployment
🛠️ Technical Improvements #
- Isolate-based Model Loading - Better crash protection and memory isolation
- Enhanced Logging System - Categorized debug logs with filtering and export capabilities
- Improved Error Recovery - Graceful fallback mechanisms for failed operations
- Better Resource Management - Optimized memory usage and cleanup procedures
0.0.1 June 27, 2025 #
🎉 Initial Release #
- Dual Engine Support - Llama (GGUF) and Gemma (TFLite) inference engines
- HuggingFace Integration - Direct model search and download from HuggingFace Hub
- Model Management - Automatic model detection and configuration
- Chat Interface - Real-time streaming text generation with conversation history
- Cross-Platform - Support for Android and iOS devices
🔧 Core Features #
- Model Search - Search and filter models by format, size, and compatibility
- Automatic Downloads - Background model downloading with progress tracking
- Engine Detection - Automatic selection of appropriate inference engine based on model format
- Memory Optimization - Dynamic configuration based on device capabilities
- Error Handling - Comprehensive exception handling and user feedback
📱 Supported Formats #
- GGUF - Quantized models for efficient inference (Q4_K_M, Q4_0, Q5_K_M, Q8_0)
- TFLite - TensorFlow Lite models optimized for mobile devices
- Model Types - Support for Gemma, Llama, Phi, DeepSeek, and custom models
🛠️ Technical Stack #
- Flutter Framework - Cross-platform mobile development
- llama_cpp_dart - Native GGUF model inference with ARM64 optimization
- flutter_gemma - TFLite model inference with GPU acceleration
- Dio HTTP Client - Efficient model downloading and API communication
- Path Provider - Cross-platform file system access and management
🎯 Initial Capabilities #
- Text Generation - Streaming text generation with customizable parameters
- Model Discovery - Browse and search thousands of compatible models
- Local Storage - Efficient model caching and management
- Debug Tools - Basic logging and error reporting
- Configuration Management - Flexible inference configuration options