llm_toolkit 0.0.3 copy "llm_toolkit: ^0.0.3" to clipboard
llm_toolkit: ^0.0.3 copied to clipboard

A comprehensive Flutter SDK for running Large Language Models (LLMs) locally on mobile and desktop devices. Supports multiple inference engines including Gemma (TFLite) and Llama (GGUF) with integrate [...]

LLM Toolkit SDK Changelog #

0.0.3 July 1, 2025 #

πŸš€ New Features #

  • RAG (Retrieval-Augmented Generation) Architecture - Complete modular RAG system with engine-agnostic design
  • Multi-Engine RAG Support - Extensible architecture supporting GGUF, SGML, TFLite, and future inference engines
  • Document Management System - Add, remove, and manage documents with automatic chunking and embedding generation
  • Intelligent Document Chunking - Multiple chunking strategies (sentence, paragraph, fixed-size) with configurable overlap
  • Semantic Search - Cosine similarity-based document retrieval with confidence scoring
  • Context-Aware Generation - Generate answers using relevant document chunks as context
  • Llama RAG Engine - First implementation using llama_cpp_dart for both embeddings and text generation
  • Flexible Configuration - Customizable similarity thresholds, chunk sizes, and generation parameters

πŸ”§ RAG System Components #

  • RagEngine Interface - Abstract base class for implementing different RAG backends
  • DocumentChunk Model - Structured document representation with metadata and relevance scoring
  • TextChunker Utility - Advanced text splitting with sentence-aware chunking and overlap management
  • SimilarityCalculator - Multiple similarity metrics (cosine, euclidean) for document matching
  • RagService Integration - Seamless integration with existing LLM service architecture
  • Batch Processing - Efficient batch embedding generation for large document collections

🧠 RAG Features #

  • Document Metadata Support - Rich metadata storage and filtering capabilities
  • Multi-Document Support - Handle multiple documents with unique identification
  • Relevance Scoring - Confidence-based ranking of retrieved document chunks
  • Dynamic Context Building - Intelligent context assembly from multiple relevant chunks
  • Memory-Efficient Processing - Optimized for mobile devices with limited memory
  • Streaming Integration - Compatible with existing streaming text generation

πŸ”§ Technical Improvements #

  • Modular Architecture - Clean separation between RAG core, engines, and utilities
  • Engine Abstraction - Easy addition of new inference engines (TFLite, ONNX, etc.)
  • Error Handling - Comprehensive error management for document processing and retrieval
  • Performance Optimization - Efficient embedding caching and similarity computation
  • Memory Management - Smart memory usage for large document collections
  • Async Operations - Non-blocking document processing and query handling

πŸ“š RAG Configuration Options #

  • Chunking Configuration - Customizable chunk size, overlap, and splitting strategies
  • Retrieval Parameters - Adjustable similarity thresholds and maximum relevant chunks
  • Generation Settings - Temperature, max tokens, and system prompt customization
  • Engine-Specific Settings - Tailored configurations for different inference engines

πŸ› οΈ Implementation Details #

  • LlamaRagEngine - Complete implementation using llama_cpp_dart for embeddings and generation
  • Document Processing Pipeline - Automated chunking, embedding, and indexing workflow
  • Query Processing - Efficient query embedding and similarity search
  • Context Assembly - Smart context building with relevance-based chunk selection
  • Response Generation - Context-aware answer generation with source attribution

πŸ”§ Improvements from Previous Version #

  • Enhanced Model Browser with tabbed interface (Search, Recommended, Downloaded)
  • File Browser Integration - Load models directly from device storage
  • Device-Specific Recommendations - Memory-aware model suggestions based on device capabilities
  • Advanced Debug Console - Real-time logging with color-coded categories and filtering
  • Improved Model Cards - Better file organization, download status, and compatibility indicators
  • Native Library Validation - Automatic detection of llama.cpp compatibility and health checks
  • Progressive Model Loading - Fallback configurations for memory-constrained devices
  • Enhanced Error Handling - Detailed error messages with troubleshooting tips and recovery suggestions

πŸ› Bug Fixes #

  • Fixed SIGSEGV crashes with problematic quantizations on Android ARM64 devices
  • Resolved memory leaks during model loading/unloading cycles
  • Fixed download interruption handling and resume functionality
  • Corrected model file validation for corrupted GGUF files
  • Fixed context size calculation for low-memory devices
  • Resolved engine detection issues for certain model formats
  • Fixed UI state management during concurrent operations

πŸ“š Documentation Updates #

  • RAG Architecture Guide - Comprehensive documentation for implementing RAG systems
  • Engine Integration Tutorial - Step-by-step guide for adding new inference engines
  • Document Processing Best Practices - Guidelines for optimal chunking and embedding strategies
  • Performance Optimization - RAG-specific performance tuning recommendations
  • API Reference - Complete API documentation for RAG components and utilities

🎯 Future RAG Roadmap #

  • Vector Database Integration - Support for external vector databases (Pinecone, Weaviate)
  • Advanced Retrieval Strategies - Hybrid search, re-ranking, and query expansion
  • Multi-Modal RAG - Support for images, PDFs, and other document formats
  • Distributed RAG - Cloud-based document processing and retrieval
  • RAG Analytics - Performance metrics and retrieval quality analysis

0.0.2 June 29, 2025 #

πŸš€ New Features #

  • Enhanced Model Browser with tabbed interface (Search, Recommended, Downloaded)
  • File Browser Integration - Load models directly from device storage
  • Device-Specific Recommendations - Memory-aware model suggestions based on device capabilities
  • Advanced Debug Console - Real-time logging with color-coded categories and filtering
  • Improved Model Cards - Better file organization, download status, and compatibility indicators
  • Native Library Validation - Automatic detection of llama.cpp compatibility and health checks
  • Progressive Model Loading - Fallback configurations for memory-constrained devices
  • Enhanced Error Handling - Detailed error messages with troubleshooting tips and recovery suggestions

πŸ”§ Improvements #

  • Quantization Safety Checks - Automatic detection and warnings for unstable quantizations (Q2_K, IQ1_S, IQ1_M)
  • Memory Management - Dynamic context size adjustment based on available device memory
  • Download Progress - Real-time progress indicators with speed metrics and ETA
  • Model Detection - Improved engine detection for GGUF and TFLite files with better accuracy
  • UI/UX Enhancements - Material Design 3 with gradient themes, better accessibility, and responsive design
  • Performance Monitoring - Token generation speed tracking, memory usage reporting, and performance analytics
  • Search Optimization - Better model discovery with multiple search strategies and relevance ranking

πŸ› Bug Fixes #

  • Fixed SIGSEGV crashes with problematic quantizations on Android ARM64 devices
  • Resolved memory leaks during model loading/unloading cycles
  • Fixed download interruption handling and resume functionality
  • Corrected model file validation for corrupted GGUF files
  • Fixed context size calculation for low-memory devices
  • Resolved engine detection issues for certain model formats
  • Fixed UI state management during concurrent operations

πŸ“š Documentation #

  • Added comprehensive model compatibility guide with device-specific recommendations
  • Enhanced troubleshooting documentation for Android devices with common solutions
  • Updated API documentation with new configuration options and examples
  • Added performance optimization guidelines for mobile deployment

πŸ› οΈ Technical Improvements #

  • Isolate-based Model Loading - Better crash protection and memory isolation
  • Enhanced Logging System - Categorized debug logs with filtering and export capabilities
  • Improved Error Recovery - Graceful fallback mechanisms for failed operations
  • Better Resource Management - Optimized memory usage and cleanup procedures

0.0.1 June 27, 2025 #

πŸŽ‰ Initial Release #

  • Dual Engine Support - Llama (GGUF) and Gemma (TFLite) inference engines
  • HuggingFace Integration - Direct model search and download from HuggingFace Hub
  • Model Management - Automatic model detection and configuration
  • Chat Interface - Real-time streaming text generation with conversation history
  • Cross-Platform - Support for Android and iOS devices

πŸ”§ Core Features #

  • Model Search - Search and filter models by format, size, and compatibility
  • Automatic Downloads - Background model downloading with progress tracking
  • Engine Detection - Automatic selection of appropriate inference engine based on model format
  • Memory Optimization - Dynamic configuration based on device capabilities
  • Error Handling - Comprehensive exception handling and user feedback

πŸ“± Supported Formats #

  • GGUF - Quantized models for efficient inference (Q4_K_M, Q4_0, Q5_K_M, Q8_0)
  • TFLite - TensorFlow Lite models optimized for mobile devices
  • Model Types - Support for Gemma, Llama, Phi, DeepSeek, and custom models

πŸ› οΈ Technical Stack #

  • Flutter Framework - Cross-platform mobile development
  • llama_cpp_dart - Native GGUF model inference with ARM64 optimization
  • flutter_gemma - TFLite model inference with GPU acceleration
  • Dio HTTP Client - Efficient model downloading and API communication
  • Path Provider - Cross-platform file system access and management

🎯 Initial Capabilities #

  • Text Generation - Streaming text generation with customizable parameters
  • Model Discovery - Browse and search thousands of compatible models
  • Local Storage - Efficient model caching and management
  • Debug Tools - Basic logging and error reporting
  • Configuration Management - Flexible inference configuration options
5
likes
110
points
119
downloads

Publisher

unverified uploader

Weekly Downloads

A comprehensive Flutter SDK for running Large Language Models (LLMs) locally on mobile and desktop devices. Supports multiple inference engines including Gemma (TFLite) and Llama (GGUF) with integrated model discovery, download, and chat capabilities.

Repository (GitHub)
View/report issues

Documentation

API reference

License

MIT (license)

Dependencies

collection, crypto, dio, flutter, flutter_gemma, http, json_annotation, llama_cpp_dart, objectbox, objectbox_flutter_libs, path, path_provider, tflite_flutter

More

Packages that depend on llm_toolkit