ollama_embedder
CLI tool written in Dart for generating text embeddings from files and folders using a local Ollama server.
Features
✅ Generate embeddings for files and directories – recursively walks directories and processes multiple files in a single run.
✅ Work with a local Ollama server – checks installation, server availability and model presence before processing.
✅ Two text‑preprocessing modes – technical (keeps code) and textual (focuses on pure text with [CODE] markers).
✅ Advanced text cleaning – removes HTML noise, cookie banners, navigation, footers, emojis and decorative frames.
✅ Smart chunking – splits long documents into overlapping chunks by paragraphs, sentences and word boundaries.
✅ Robust embedding requests – retries on transient Ollama errors with helpful logging and hints.
✅ Configurable behavior – tune server URL, model, timeouts, max file size, input/output paths and processing mode.
✅ Structured JSON output – emits EmbeddingChunk arrays ready for ingestion into vector databases and RAG systems.
✅ Test‑covered core – chunking, preprocessing and processing pipeline are covered by unit tests.
Installation
- Install Dart SDK with a version compatible with
pubspec.yaml(currently>=3.1.0 <4.0.0). - Install Ollama (desktop or server):
- Download it from
https://ollama.aiand install. - Start the server:
ollama serve
- Download it from
- Install the CLI globally from pub.flutter-io.cn:
dart pub global activate ollama_embedder
Quick start
Prerequisites
- Dart SDK installed (compatible with the version in
pubspec.yaml, currently^3.1.0 <4.0.0). - Ollama installed and running:
- download from
https://ollama.aiand install; - start the server:
ollama serve - pull the embedding model you plan to use (for example):
ollama pull nomic-embed-text
- download from
Key CLI options (see also --help):
-i, --input(required): file or directory to process.-o, --output: directory where.embedding.jsonfiles will be written (by default a subdirectory likeembedding_genis used).-u, --url: Ollama server URL (defaulthttp://localhost:11434).-m, --model: embedding model name (defaultnomic-embed-text).--timeout: request timeout in milliseconds (default60000).-v, --verbose: verbose logging (recommended for production to see retries and hints).--mode: text‑processing mode –technical(keeps code) ortextual(collapses code into[CODE]markers).
Examples:
ollama_embedder --input source
ollama_embedder -i source -u http://localhost:11434 -m nomic-embed-text
ollama_embedder -i source --verbose --mode textual
How it works
The pipeline on a high level:
- Text preprocessing (
TextPreprocessor):- normalize line breaks and whitespace, remove invisible characters;
- strip HTML, cookie banners, footers, navigation;
- replace URL/EMAIL/PATH/ID with markers;
- carefully handle Markdown headings, lists and code;
- either preserve or collapse code depending on
TextProcessingMode.
- Chunking (
ChunkingService):- if text length ≤
maxSingleChunk(default 3000 chars) – a single chunk; - otherwise split by paragraphs/sentences/words with overlaps (
overlapChars).
- if text length ≤
- Embedding generation via Ollama (
EmbeddingService):POST /api/embeddingswithmodelandprompt;- retries with exponential backoff for transient and server‑stability issues;
- special handling when the model is missing.
- Saving results (
EmbeddingProcessor):- build an array of
EmbeddingChunk; - generate
doc_idfrom the file’s relative path; - write to
<original_path>.embedding.jsonin the output directory.
- build an array of
Output format
Each processed source file gets a corresponding JSON file with an array of chunks:
[
{
"doc_id": "source/test.md",
"chunk_id": 0,
"clean_content": "Cleaned single-line chunk text without line breaks...",
"vector": [0.123, 0.456, "..."],
"metadata": {
"source": "source/test.md",
"section": "full_doc",
"type": "text",
"created_at": "2025-01-01T12:00:00.000Z"
}
}
]
doc_id: relative file path (normalized with/separators).chunk_id: sequential chunk number within a document.clean_content: cleaned text with all line breaks replaced by spaces.vector: embedding vector (size depends on the chosen model).metadata: arbitrary metadata map with basic technical information.
The EmbeddingChunk model is defined in lib/models/embedding_chunk.dart, and the Document model in lib/models/document.dart is convenient for integration with vector databases.
Default Configuration
ollamaUrl:http://localhost:11434;ollamaModel:nomic-embed-text;ollamaTimeoutMs:60000;embeddingExtension:.embedding.json;maxFileSize:10 MB(larger files are skipped);defaultOutputSubdir:embedding_gen;defaultTextProcessingMode:technical.
The CliConfig class in lib/config/cli_config.dart combines these values and allows overriding them via CLI arguments.
Text preprocessing logic
The core logic is implemented in lib/services/text_preprocessor.dart:
TextProcessingMode.technical:- preserves code blocks and inline code as much as possible;
- whitespace and line‑break normalization do not break code structure;
- useful for code‑centric use cases (code search, code‑RAG, hybrid search).
TextProcessingMode.textual:- collapses code into
[CODE]markers; - focuses on natural‑language content (documentation, articles, descriptions).
- collapses code into
Additionally:
- HTML tags, comments (
<script>,<style>) and entities are removed or decoded; - cookie banners, footers, navigation blocks and pseudographics are stripped;
- URLs, e‑mails, paths and long IDs are replaced with
[URL],[EMAIL],[PATH],[ID]; - punctuation noise such as
!!!??is normalized.
Chunking
Chunking is handled by lib/services/chunking_service.dart:
maxChars: maximum chunk size (default1500characters).overlapChars: overlap size between chunks (default200characters).maxSingleChunk: maximum length that is allowed to remain a single chunk (default3000characters).- Chunks are labeled with sections such as
intro,urls,lists,code,auto.
This makes embeddings more robust when searching over text fragments and reduces context loss.
Skipped files
EmbeddingProcessor intentionally skips:
- hidden files (starting with
.); - service files (
LICENSE,README.md); - already generated
.embedding.jsonfiles; - binary files (
.png,.jpg,.pdf,.zip,.exe,.dll, etc.).
Test coverage
The project has a solid automated test suite with overall line coverage around 78% across all core components:
| File | Coverage |
|---|---|
text_preprocessor.dart |
≈85% – both technical and textual modes, cleaning rules and markers |
chunking_service.dart |
≈83% – chunk boundaries, overlaps and section labelling |
embedding_processor.dart |
≈75% – file traversal, skipping logic and output structure |
embedding_chunk.dart |
100% – model structure and JSON (de)serialization |
Test categories
✅ Text preprocessing: normalization, HTML/boilerplate removal, URL/EMAIL/PATH/ID markers, two processing modes.
✅ Chunking logic: single vs multi‑chunk documents, overlaps, section tags (intro, urls, lists, code, auto).
✅ Embedding pipeline: correct skipping of files, doc_id calculation, output file naming and locations.
✅ Models & serialization: EmbeddingChunk and related models, JSON input/output stability.
✅ Edge cases: very small and very large documents, empty/near‑empty content, service and binary files.
It is recommended to run the test suite after any changes to preprocessing, chunking, chunking configuration or output format logic.
Limitations and recommendations
- Make sure Ollama is running and the model is pulled:
ollama pull <model>. - For large corpora, monitor Ollama server load (use
--verboseto see retries and hints). - Avoid changing the
.embedding.jsonformat if external systems (vector DB, RAG service, etc.) already depend on it.
Libraries
- config/cli_config
- config/default_config
- l10n/messages
- models/document
- models/embedding_chunk
- services/chunking_service
- Service for splitting text into chunks for embedding
- services/embedding_processor
- services/embedding_service
- services/ollama_checker
- services/text_preprocessor
- Service for cleaning and preparing text for vectorization