pdf_text_extraction 2.1.0
pdf_text_extraction: ^2.1.0 copied to clipboard
pdf_text_extraction
pdf_text_extraction #
Bindings and convenience wrappers around a fork of xpdf that enable extracting text and metadata from PDF files using Dart. The native bits are available for Linux and Windows only.
ℹ️ The project depends on a fork of xpdf maintained at https://github.com/insinfo/xpdf.
Platform requirements #
- Windows: ship the compiled
pdftotext.dllandTextExtraction.dllalongside your executable. - Linux: ensure the GNU C++ runtime (libstdc++6) is available before using the package.
sudo apt-get install libstdc++6
Getting started #
Add the package as a dependency and ensure the native libraries are available on the execution path or in the working directory. Two APIs are exposed:
- Low level bindings generated by
package:ffigen, mirroring the C API. - High level wrappers that take care of memory management and validation.
Low-level usage #
import 'dart:io' show Platform, Directory;
import 'package:ffi/ffi.dart';
import 'dart:ffi';
import 'package:path/path.dart' as path;
import 'package:pdf_text_extraction/pdf_text_extraction.dart';
import 'package:pdf_text_extraction/src/pdf_to_text_bindings.dart';
void logCallback(Pointer<Int8> msg) {
print(nativeInt8ToString(msg));
}
void main() {
var libraryPath = path.join(Directory.current.path, 'pdftotext.dll');
if (Platform.isLinux) {
libraryPath = path.join(Directory.current.path, 'pdftotext.so');
}
final dylib = DynamicLibrary.open(libraryPath);
var pdfLib = PDFToTextBindings(dylib);
//input pdf file
var uriPointer = stringToNativeInt8('pdf_file.pdf', allocator: calloc);
// output text character encoding
var textOutEnc = stringToNativeInt8('UTF-8', allocator: calloc);
var layout = stringToNativeInt8('rawOrder', allocator: calloc);
//function for print log info
var lgf = Pointer.fromFunction<Void Function(Pointer<Int8>)>(logCallback);
Pointer<Pointer<Int8>> textOut = calloc();
var result = pdfLib.extractText(
uriPointer, 1, 1, textOutEnc, layout, textOut, lgf, nullptr, nullptr);
var textResult = nativeInt8ToString(textOut.value);
calloc.free(uriPointer);
calloc.free(textOutEnc);
calloc.free(textOut);
if (result == 0) {
print('result ok: $textResult');
} else {
print('erro on text extraction');
}
}
High-level usage #
void main() {
final wrapper = PDFToTextWrapping();
final text = wrapper.extractText(
'pdf_file.pdf',
startPage: 1,
endPage: 1,
);
print('result: $text');
}
PDFToTextWrapping also exposes getPagesCount and reports any native errors
through the static lastError property.
Managing isolate contention #
When you need to invoke PDFToTextWrapping from multiple isolates, always go
through PDFToTextWrappingService. The service coordinates access using a
filesystem mutex and prevents the native library from being torn down while
another isolate is still working with it.
Running concurrent isolates without the service typically leads to native crashes similar to:
PS C:\MyDartProjects\pdf_text_extraction> dart .\example\pdf_to_text_isolate_example.dart
===== CRASH =====
ExceptionCode=-1073741819, ExceptionFlags=0, ExceptionAddress=00007FFEFED58D5D
...
pc 0x000001780444b055 fp 0x000000426c8feb78 sp 0x000000426c8feab0 [Unoptimized] PDFToTextWrapping.getPagesCount
Using the service restores stability:
PS C:\MyDartProjects\pdf_text_extraction> dart .\example\pdf_to_text_isolate_example.dart
Isolate 0 extracted 1842 characters from the PDF.
Isolate 3 extracted 1842 characters from the PDF.
Isolate 1 extracted 1842 characters from the PDF.
Isolate 2 extracted 1842 characters from the PDF.
All isolates finished without fatal contention.
In real code, wire it up like this:
final service = PDFToTextWrappingService();
await service.run((wrapper) {
final pages = wrapper.getPagesCount('document.pdf');
final text = wrapper.extractText('document.pdf', endPage: pages > 0 ? 1 : 0);
print(text);
});
See example/pdf_to_text_isolate_example.dart for a complete runnable sample.
Testing #
The repository ships with unit and integration tests. To use the integration
tests you must have a fixture PDF (for example 1417.pdf) and the native
libraries in the root of the project.
dart test
Regenerating bindings #
If you need to regenerate the FFI bindings after updating the native headers, run:
dart run ffigen --config ffigen.yaml