llm_toolkit 0.0.3
llm_toolkit: ^0.0.3 copied to clipboard
A comprehensive Flutter SDK for running Large Language Models (LLMs) locally on mobile and desktop devices. Supports multiple inference engines including Gemma (TFLite) and Llama (GGUF) with integrate [...]
LLM Toolkit SDK Changelog #
0.0.3 July 1, 2025 #
π New Features #
- RAG (Retrieval-Augmented Generation) Architecture - Complete modular RAG system with engine-agnostic design
- Multi-Engine RAG Support - Extensible architecture supporting GGUF, SGML, TFLite, and future inference engines
- Document Management System - Add, remove, and manage documents with automatic chunking and embedding generation
- Intelligent Document Chunking - Multiple chunking strategies (sentence, paragraph, fixed-size) with configurable overlap
- Semantic Search - Cosine similarity-based document retrieval with confidence scoring
- Context-Aware Generation - Generate answers using relevant document chunks as context
- Llama RAG Engine - First implementation using llama_cpp_dart for both embeddings and text generation
- Flexible Configuration - Customizable similarity thresholds, chunk sizes, and generation parameters
π§ RAG System Components #
- RagEngine Interface - Abstract base class for implementing different RAG backends
- DocumentChunk Model - Structured document representation with metadata and relevance scoring
- TextChunker Utility - Advanced text splitting with sentence-aware chunking and overlap management
- SimilarityCalculator - Multiple similarity metrics (cosine, euclidean) for document matching
- RagService Integration - Seamless integration with existing LLM service architecture
- Batch Processing - Efficient batch embedding generation for large document collections
π§ RAG Features #
- Document Metadata Support - Rich metadata storage and filtering capabilities
- Multi-Document Support - Handle multiple documents with unique identification
- Relevance Scoring - Confidence-based ranking of retrieved document chunks
- Dynamic Context Building - Intelligent context assembly from multiple relevant chunks
- Memory-Efficient Processing - Optimized for mobile devices with limited memory
- Streaming Integration - Compatible with existing streaming text generation
π§ Technical Improvements #
- Modular Architecture - Clean separation between RAG core, engines, and utilities
- Engine Abstraction - Easy addition of new inference engines (TFLite, ONNX, etc.)
- Error Handling - Comprehensive error management for document processing and retrieval
- Performance Optimization - Efficient embedding caching and similarity computation
- Memory Management - Smart memory usage for large document collections
- Async Operations - Non-blocking document processing and query handling
π RAG Configuration Options #
- Chunking Configuration - Customizable chunk size, overlap, and splitting strategies
- Retrieval Parameters - Adjustable similarity thresholds and maximum relevant chunks
- Generation Settings - Temperature, max tokens, and system prompt customization
- Engine-Specific Settings - Tailored configurations for different inference engines
π οΈ Implementation Details #
- LlamaRagEngine - Complete implementation using llama_cpp_dart for embeddings and generation
- Document Processing Pipeline - Automated chunking, embedding, and indexing workflow
- Query Processing - Efficient query embedding and similarity search
- Context Assembly - Smart context building with relevance-based chunk selection
- Response Generation - Context-aware answer generation with source attribution
π§ Improvements from Previous Version #
- Enhanced Model Browser with tabbed interface (Search, Recommended, Downloaded)
- File Browser Integration - Load models directly from device storage
- Device-Specific Recommendations - Memory-aware model suggestions based on device capabilities
- Advanced Debug Console - Real-time logging with color-coded categories and filtering
- Improved Model Cards - Better file organization, download status, and compatibility indicators
- Native Library Validation - Automatic detection of llama.cpp compatibility and health checks
- Progressive Model Loading - Fallback configurations for memory-constrained devices
- Enhanced Error Handling - Detailed error messages with troubleshooting tips and recovery suggestions
π Bug Fixes #
- Fixed SIGSEGV crashes with problematic quantizations on Android ARM64 devices
- Resolved memory leaks during model loading/unloading cycles
- Fixed download interruption handling and resume functionality
- Corrected model file validation for corrupted GGUF files
- Fixed context size calculation for low-memory devices
- Resolved engine detection issues for certain model formats
- Fixed UI state management during concurrent operations
π Documentation Updates #
- RAG Architecture Guide - Comprehensive documentation for implementing RAG systems
- Engine Integration Tutorial - Step-by-step guide for adding new inference engines
- Document Processing Best Practices - Guidelines for optimal chunking and embedding strategies
- Performance Optimization - RAG-specific performance tuning recommendations
- API Reference - Complete API documentation for RAG components and utilities
π― Future RAG Roadmap #
- Vector Database Integration - Support for external vector databases (Pinecone, Weaviate)
- Advanced Retrieval Strategies - Hybrid search, re-ranking, and query expansion
- Multi-Modal RAG - Support for images, PDFs, and other document formats
- Distributed RAG - Cloud-based document processing and retrieval
- RAG Analytics - Performance metrics and retrieval quality analysis
0.0.2 June 29, 2025 #
π New Features #
- Enhanced Model Browser with tabbed interface (Search, Recommended, Downloaded)
- File Browser Integration - Load models directly from device storage
- Device-Specific Recommendations - Memory-aware model suggestions based on device capabilities
- Advanced Debug Console - Real-time logging with color-coded categories and filtering
- Improved Model Cards - Better file organization, download status, and compatibility indicators
- Native Library Validation - Automatic detection of llama.cpp compatibility and health checks
- Progressive Model Loading - Fallback configurations for memory-constrained devices
- Enhanced Error Handling - Detailed error messages with troubleshooting tips and recovery suggestions
π§ Improvements #
- Quantization Safety Checks - Automatic detection and warnings for unstable quantizations (Q2_K, IQ1_S, IQ1_M)
- Memory Management - Dynamic context size adjustment based on available device memory
- Download Progress - Real-time progress indicators with speed metrics and ETA
- Model Detection - Improved engine detection for GGUF and TFLite files with better accuracy
- UI/UX Enhancements - Material Design 3 with gradient themes, better accessibility, and responsive design
- Performance Monitoring - Token generation speed tracking, memory usage reporting, and performance analytics
- Search Optimization - Better model discovery with multiple search strategies and relevance ranking
π Bug Fixes #
- Fixed SIGSEGV crashes with problematic quantizations on Android ARM64 devices
- Resolved memory leaks during model loading/unloading cycles
- Fixed download interruption handling and resume functionality
- Corrected model file validation for corrupted GGUF files
- Fixed context size calculation for low-memory devices
- Resolved engine detection issues for certain model formats
- Fixed UI state management during concurrent operations
π Documentation #
- Added comprehensive model compatibility guide with device-specific recommendations
- Enhanced troubleshooting documentation for Android devices with common solutions
- Updated API documentation with new configuration options and examples
- Added performance optimization guidelines for mobile deployment
π οΈ Technical Improvements #
- Isolate-based Model Loading - Better crash protection and memory isolation
- Enhanced Logging System - Categorized debug logs with filtering and export capabilities
- Improved Error Recovery - Graceful fallback mechanisms for failed operations
- Better Resource Management - Optimized memory usage and cleanup procedures
0.0.1 June 27, 2025 #
π Initial Release #
- Dual Engine Support - Llama (GGUF) and Gemma (TFLite) inference engines
- HuggingFace Integration - Direct model search and download from HuggingFace Hub
- Model Management - Automatic model detection and configuration
- Chat Interface - Real-time streaming text generation with conversation history
- Cross-Platform - Support for Android and iOS devices
π§ Core Features #
- Model Search - Search and filter models by format, size, and compatibility
- Automatic Downloads - Background model downloading with progress tracking
- Engine Detection - Automatic selection of appropriate inference engine based on model format
- Memory Optimization - Dynamic configuration based on device capabilities
- Error Handling - Comprehensive exception handling and user feedback
π± Supported Formats #
- GGUF - Quantized models for efficient inference (Q4_K_M, Q4_0, Q5_K_M, Q8_0)
- TFLite - TensorFlow Lite models optimized for mobile devices
- Model Types - Support for Gemma, Llama, Phi, DeepSeek, and custom models
π οΈ Technical Stack #
- Flutter Framework - Cross-platform mobile development
- llama_cpp_dart - Native GGUF model inference with ARM64 optimization
- flutter_gemma - TFLite model inference with GPU acceleration
- Dio HTTP Client - Efficient model downloading and API communication
- Path Provider - Cross-platform file system access and management
π― Initial Capabilities #
- Text Generation - Streaming text generation with customizable parameters
- Model Discovery - Browse and search thousands of compatible models
- Local Storage - Efficient model caching and management
- Debug Tools - Basic logging and error reporting
- Configuration Management - Flexible inference configuration options