Post · bonfire.cafe

Post

@Greenheart@fosstodon.org · 6 months ago

Have you ever needed to extract text from images embedded in a #PDF? I can highly recommend the open source #CLI tool #OCRmyPDF which is easy to automate in for example a #DataPipeline.

It uses #Tesseract #OCR under the hood and has many options to experiment with to get the best possible accuracy for your language and PDF content.

You can get started with just a few commands:

https://samuelplumppu.se/blog/automated-text-extraction-from-pdf-images-with-ocrmypdf

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances

Bonfire social · 1.0.2-alpha.34 no JS en

Automatic federation enabled