Discussion
Loading...

Discussion

  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
Karsten Schmidt
@toxi@mastodon.thi.ng  ·  activity timestamp 2 days ago

Been in-depth studying the PDF file format spec for the past few days... it's mind boggling that this format with its capacity for monstrous complexity managed to become the de-facto standard for modern documents. So many questionable decisions & undue flexibility in there (in the wrong places) making even a simple task of reliable metadata extraction (for example) incredibly hard... I understand and value flexibility, but not at this cost! Guess I will have to keep on using one of the mega monolithic all-you-can-eat-PDF frameworks after all (which I've been trying to avoid...)

  • Copy link
  • Flag this post
  • Block
cms
@cms@fed.beatworm.co.uk replied  ·  activity timestamp 13 hours ago

@toxi Hi! Yeah PDF is maddeningly baroque. I still really like it though.
PDF file format is really interesting software archaeology! It's something that troubles me a bit how much history rot has happened, that information about this stuff is moderately difficult to search for because of semantic overload. And there's lots of telephone-game style over repeated factoids like 'PDF is a simplified form of PostScript', which are only half-true, and give the wrong kind of emphasis because they're conflating a few things.
The most important thing to understand is that PDF is really old! The roots of PDF come from John Warnock's 'Project Camelot' white paper, which is from the dawn of the 90s. So the design goals are realised within the constraints of 80s computer architecture, around 16bit/32bit transition. Computers are very slow compared to what we have now.
Addressable RAM is measured in MB. Multi-tasking is optional. File formats, are generally tightly coupled to a single application. Multi-component documents / multimedia, are not really a thing.
Document sharing is close to impossible, digitally. You can't share detailed document layouts as bitmaps, because the resolution required is way above the ability of computer displays (one megapixel is pretty fancy for a VDU), and colour representation is pretty unevenly supported. If you want to send a document to someone else and have them view it on screen as close to possible as you made it, they basically have to have the same software as you, on the same platform, with the same drivers, and all the same fonts and plugins. Warnock/Adobe wanted to solve this problem once, and for everyone.
This really was an astounding reach. The format had to be completely self-contained, self-hosting fonts, images, and other media. It needed to be futureproof enough to extend forwards to media types and resolutions and data formats that do not yet exist.
And it needed to be able to render fast enough to use as a file viewer, whilst depending on resources that were likely to be significantly larger than the working memory of the viewing platform. And stored on slow magnetic storage.
This is the model for the problem you have to solve. And looking at it like that, you end up designing a contained filestystem / object store, which is really the right way to think about PDF files. It's a database system for modeling a tree of interelated media. The main technical constraint to solve is _rapid_ dynamic assembly and rendering of a sub-view of that tree from slow secondary storage. The storage system uses reference counting, and internal compression for efficiency, and it also supports incremental updating, so that PDFs can be amended without having to rebuild the entire database. An example feature that this architecture allows is efficient browsing of large documents via thumbnails / scaled minimaps on a screen too small to show more than a 30% of any single page. Almost completely irrelevant to today's hardware, but a deal-breaking feature at the time.
I think PDF an astounding technical achievement. I also think it's a huge odd and complex and weird thing that only makes sense in a historical context. It's full of hard engineering optimisations for things that are barely problems now. HTML / CSS / SVG has just about caught up with PDF capabilities at this point (many of it's capabilites are much lower priority in 2025, and maybe don't need to be supported, and we need to support a much wider array of viewports and contexts now) , but we're talking about a software system that's nearly 40 years old and still pretty valid.

  • Copy link
  • Flag this comment
  • Block
Log in

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.0 no JS en
Automatic federation enabled
  • Explore
  • About
  • Members
  • Code of Conduct
Home
Login