pdf_text_extraction 2.1.0 copy "pdf_text_extraction: ^2.1.0" to clipboard
pdf_text_extraction: ^2.1.0 copied to clipboard

pdf_text_extraction

pdf_text_extraction #

Bindings and convenience wrappers around a fork of xpdf that enable extracting text and metadata from PDF files using Dart. The native bits are available for Linux and Windows only.

ℹ️ The project depends on a fork of xpdf maintained at https://github.com/insinfo/xpdf.

Platform requirements #

  • Windows: ship the compiled pdftotext.dll and TextExtraction.dll alongside your executable.
  • Linux: ensure the GNU C++ runtime (libstdc++6) is available before using the package.
sudo apt-get install libstdc++6

Getting started #

Add the package as a dependency and ensure the native libraries are available on the execution path or in the working directory. Two APIs are exposed:

  1. Low level bindings generated by package:ffigen, mirroring the C API.
  2. High level wrappers that take care of memory management and validation.

Low-level usage #

import 'dart:io' show Platform, Directory;
import 'package:ffi/ffi.dart';
import 'dart:ffi';
import 'package:path/path.dart' as path;
import 'package:pdf_text_extraction/pdf_text_extraction.dart';
import 'package:pdf_text_extraction/src/pdf_to_text_bindings.dart';

void logCallback(Pointer<Int8> msg) {
  print(nativeInt8ToString(msg));
}

void main() {
  var libraryPath = path.join(Directory.current.path, 'pdftotext.dll');
  if (Platform.isLinux) {
    libraryPath = path.join(Directory.current.path, 'pdftotext.so');
  }

  final dylib = DynamicLibrary.open(libraryPath);
  var pdfLib = PDFToTextBindings(dylib);
  //input pdf file
  var uriPointer = stringToNativeInt8('pdf_file.pdf', allocator: calloc);
  // output text character encoding 
  var textOutEnc = stringToNativeInt8('UTF-8', allocator: calloc);
  var layout = stringToNativeInt8('rawOrder', allocator: calloc);
  //function for print log info
  var lgf = Pointer.fromFunction<Void Function(Pointer<Int8>)>(logCallback);

  Pointer<Pointer<Int8>> textOut = calloc();

  var result = pdfLib.extractText(
      uriPointer, 1, 1, textOutEnc, layout, textOut, lgf, nullptr, nullptr);

  var textResult = nativeInt8ToString(textOut.value);

  calloc.free(uriPointer);
  calloc.free(textOutEnc);
  calloc.free(textOut);

  if (result == 0) {
    print('result ok: $textResult');
  } else {
    print('erro on text extraction');
  }
}

High-level usage #

void main() {
  final wrapper = PDFToTextWrapping();
  final text = wrapper.extractText(
    'pdf_file.pdf',
    startPage: 1,
    endPage: 1,
  );
  print('result: $text');
}

PDFToTextWrapping also exposes getPagesCount and reports any native errors through the static lastError property.

Managing isolate contention #

When you need to invoke PDFToTextWrapping from multiple isolates, always go through PDFToTextWrappingService. The service coordinates access using a filesystem mutex and prevents the native library from being torn down while another isolate is still working with it.

Running concurrent isolates without the service typically leads to native crashes similar to:

PS C:\MyDartProjects\pdf_text_extraction> dart .\example\pdf_to_text_isolate_example.dart

===== CRASH =====
ExceptionCode=-1073741819, ExceptionFlags=0, ExceptionAddress=00007FFEFED58D5D
...
pc 0x000001780444b055 fp 0x000000426c8feb78 sp 0x000000426c8feab0 [Unoptimized] PDFToTextWrapping.getPagesCount

Using the service restores stability:

PS C:\MyDartProjects\pdf_text_extraction> dart .\example\pdf_to_text_isolate_example.dart
Isolate 0 extracted 1842 characters from the PDF.
Isolate 3 extracted 1842 characters from the PDF.
Isolate 1 extracted 1842 characters from the PDF.
Isolate 2 extracted 1842 characters from the PDF.
All isolates finished without fatal contention.

In real code, wire it up like this:

final service = PDFToTextWrappingService();
await service.run((wrapper) {
  final pages = wrapper.getPagesCount('document.pdf');
  final text = wrapper.extractText('document.pdf', endPage: pages > 0 ? 1 : 0);
  print(text);
});

See example/pdf_to_text_isolate_example.dart for a complete runnable sample.

Testing #

The repository ships with unit and integration tests. To use the integration tests you must have a fixture PDF (for example 1417.pdf) and the native libraries in the root of the project.

dart test

Regenerating bindings #

If you need to regenerate the FFI bindings after updating the native headers, run:

dart run ffigen --config ffigen.yaml
4
likes
140
points
265
downloads

Publisher

unverified uploader

Weekly Downloads

pdf_text_extraction

Repository (GitHub)
View/report issues

Documentation

API reference

License

Apache-2.0 (license)

Dependencies

ffi, path

More

Packages that depend on pdf_text_extraction