ollama_embedder

CLI tool written in Dart for generating text embeddings from files and folders using a local Ollama server.

Features

✅ Generate embeddings for files and directories – recursively walks directories and processes multiple files in a single run.
✅ Work with a local Ollama server – checks installation, server availability and model presence before processing.
✅ Two text‑preprocessing modes – technical (keeps code) and textual (focuses on pure text with [CODE] markers).
✅ Advanced text cleaning – removes HTML noise, cookie banners, navigation, footers, emojis and decorative frames.
✅ Smart chunking – splits long documents into overlapping chunks by paragraphs, sentences and word boundaries.
✅ Robust embedding requests – retries on transient Ollama errors with helpful logging and hints.
✅ Configurable behavior – tune server URL, model, timeouts, max file size, input/output paths and processing mode.
✅ Structured JSON output – emits EmbeddingChunk arrays ready for ingestion into vector databases and RAG systems.
✅ Test‑covered core – chunking, preprocessing and processing pipeline are covered by unit tests.

Installation

Install Dart SDK with a version compatible with pubspec.yaml (currently >=3.1.0 <4.0.0).
Install Ollama (desktop or server):
- Download it from https://ollama.ai and install.
- Start the server:
```
ollama serve
```
Install the CLI globally from pub.flutter-io.cn:
```
dart pub global activate ollama_embedder
```

Quick start

Prerequisites

Dart SDK installed (compatible with the version in pubspec.yaml, currently ^3.1.0 <4.0.0).
Ollama installed and running:
- download from https://ollama.ai and install;
- start the server:
```
ollama serve
```
- pull the embedding model you plan to use (for example):
```
ollama pull nomic-embed-text
```

Key CLI options (see also --help):

-i, --input (required): file or directory to process.
-o, --output: directory where .embedding.json files will be written (by default a subdirectory like embedding_gen is used).
-u, --url: Ollama server URL (default http://localhost:11434).
-m, --model: embedding model name (default nomic-embed-text).
--timeout: request timeout in milliseconds (default 60000).
-v, --verbose: verbose logging (recommended for production to see retries and hints).
--mode: text‑processing mode – technical (keeps code) or textual (collapses code into [CODE] markers).

Examples:

ollama_embedder --input source
ollama_embedder -i source -u http://localhost:11434 -m nomic-embed-text
ollama_embedder -i source --verbose --mode textual

How it works

The pipeline on a high level:

Text preprocessing (TextPreprocessor):
- normalize line breaks and whitespace, remove invisible characters;
- strip HTML, cookie banners, footers, navigation;
- replace URL/EMAIL/PATH/ID with markers;
- carefully handle Markdown headings, lists and code;
- either preserve or collapse code depending on TextProcessingMode.
Chunking (ChunkingService):
- if text length ≤ maxSingleChunk (default 3000 chars) – a single chunk;
- otherwise split by paragraphs/sentences/words with overlaps (overlapChars).
Embedding generation via Ollama (EmbeddingService):
- POST /api/embeddings with model and prompt;
- retries with exponential backoff for transient and server‑stability issues;
- special handling when the model is missing.
Saving results (EmbeddingProcessor):
- build an array of EmbeddingChunk;
- generate doc_id from the file’s relative path;
- write to <original_path>.embedding.json in the output directory.

Output format

Each processed source file gets a corresponding JSON file with an array of chunks:

[
  {
    "doc_id": "source/test.md",
    "chunk_id": 0,
    "clean_content": "Cleaned single-line chunk text without line breaks...",
    "vector": [0.123, 0.456, "..."],
    "metadata": {
      "source": "source/test.md",
      "section": "full_doc",
      "type": "text",
      "created_at": "2025-01-01T12:00:00.000Z"
    }
  }
]

doc_id: relative file path (normalized with / separators).
chunk_id: sequential chunk number within a document.
clean_content: cleaned text with all line breaks replaced by spaces.
vector: embedding vector (size depends on the chosen model).
metadata: arbitrary metadata map with basic technical information.

The EmbeddingChunk model is defined in lib/models/embedding_chunk.dart, and the Document model in lib/models/document.dart is convenient for integration with vector databases.

Default Configuration

ollamaUrl: http://localhost:11434;
ollamaModel: nomic-embed-text;
ollamaTimeoutMs: 60000;
embeddingExtension: .embedding.json;
maxFileSize: 10 MB (larger files are skipped);
defaultOutputSubdir: embedding_gen;
defaultTextProcessingMode: technical.

The CliConfig class in lib/config/cli_config.dart combines these values and allows overriding them via CLI arguments.

Text preprocessing logic

The core logic is implemented in lib/services/text_preprocessor.dart:

TextProcessingMode.technical:
- preserves code blocks and inline code as much as possible;
- whitespace and line‑break normalization do not break code structure;
- useful for code‑centric use cases (code search, code‑RAG, hybrid search).
TextProcessingMode.textual:
- collapses code into [CODE] markers;
- focuses on natural‑language content (documentation, articles, descriptions).

Additionally:

HTML tags, comments (<script>, <style>) and entities are removed or decoded;
cookie banners, footers, navigation blocks and pseudographics are stripped;
URLs, e‑mails, paths and long IDs are replaced with [URL], [EMAIL], [PATH], [ID];
punctuation noise such as !!!?? is normalized.

Chunking

Chunking is handled by lib/services/chunking_service.dart:

maxChars: maximum chunk size (default 1500 characters).
overlapChars: overlap size between chunks (default 200 characters).
maxSingleChunk: maximum length that is allowed to remain a single chunk (default 3000 characters).
Chunks are labeled with sections such as intro, urls, lists, code, auto.

This makes embeddings more robust when searching over text fragments and reduces context loss.

Skipped files

EmbeddingProcessor intentionally skips:

hidden files (starting with .);
service files (LICENSE, README.md);
already generated .embedding.json files;
binary files (.png, .jpg, .pdf, .zip, .exe, .dll, etc.).

Test coverage

The project has a solid automated test suite with overall line coverage around 78% across all core components:

File	Coverage
`text_preprocessor.dart`	≈85% – both `technical` and `textual` modes, cleaning rules and markers
`chunking_service.dart`	≈83% – chunk boundaries, overlaps and section labelling
`embedding_processor.dart`	≈75% – file traversal, skipping logic and output structure
`embedding_chunk.dart`	100% – model structure and JSON (de)serialization

Test categories

✅ Text preprocessing: normalization, HTML/boilerplate removal, URL/EMAIL/PATH/ID markers, two processing modes.
✅ Chunking logic: single vs multi‑chunk documents, overlaps, section tags (intro, urls, lists, code, auto).
✅ Embedding pipeline: correct skipping of files, doc_id calculation, output file naming and locations.
✅ Models & serialization: EmbeddingChunk and related models, JSON input/output stability.
✅ Edge cases: very small and very large documents, empty/near‑empty content, service and binary files.

It is recommended to run the test suite after any changes to preprocessing, chunking, chunking configuration or output format logic.

Limitations and recommendations

Make sure Ollama is running and the model is pulled: ollama pull <model>.
For large corpora, monitor Ollama server load (use --verbose to see retries and hints).
Avoid changing the .embedding.json format if external systems (vector DB, RAG service, etc.) already depend on it.