Discussion
Loading...

Post

  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
Samuel Plumppu
@Greenheart@fosstodon.org  ·  activity timestamp last month

Have you ever needed to extract text from images embedded in a #PDF? I can highly recommend the open source #CLI tool #OCRmyPDF which is easy to automate in for example a #DataPipeline.

It uses #Tesseract#OCR under the hood and has many options to experiment with to get the best possible accuracy for your language and PDF content.

You can get started with just a few commands:

https://samuelplumppu.se/blog/automated-text-extraction-from-pdf-images-with-ocrmypdf

  • Copy link
  • Flag this post
  • Block
Log in

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.0-rc.3.1 no JS en
Automatic federation enabled
  • Explore
  • About
  • Members
  • Code of Conduct
Home
Login