Discussion
Loading...

Post

Log in
  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
Project Gutenberg
Project Gutenberg
@gutenberg_org@mastodon.social  ·  activity timestamp 4 hours ago

Over 32,000 medieval manuscripts transcribed in four months using AI

Medievalists can now access automated transcriptions of 32,763 digitised medieval manuscripts, produced in just four months as part of a project called CoMMA—a large-scale corpus designed to make manuscript texts searchable and analysable at a scale that would be impossible to tackle by hand.

https://www.medievalists.net/2026/01/32000-medieval-manuscripts-transcribed-using-ai/

Original paper:
https://inria.hal.science/hal-05299220

The CoMMA website:
https://comma.inria.fr/homepage

#books #old_manuscripts

First page of Chronicon Pictum, the "Illuminated Chronicle" from the court of King Louis the Great of Hungary from 1358.

Main illumination shows:

A multi-paneled scene within an elaborate architectural frame
Central figure: A crowned king (likely representing Hungarian royalty) seated on a throne
Left panel: Armed warriors or knights
Right panel: Groups of courtiers or nobles
Rich colors: deep blues, reds, golds, and ochres
Gothic architectural elements including towers and arches

Latin text in two columns
Red lettering (rubrics) for important headings and initial words
Black text for the main chronicle
Beautiful calligraphy in Gothic script

https://en.wikipedia.org/wiki/Illuminated_manuscript#/media/File:K%C3%A9pes_kr%C3%B3nika_els%C5%91_lapja.jpg
First page of Chronicon Pictum, the "Illuminated Chronicle" from the court of King Louis the Great of Hungary from 1358. Main illumination shows: A multi-paneled scene within an elaborate architectural frame Central figure: A crowned king (likely representing Hungarian royalty) seated on a throne Left panel: Armed warriors or knights Right panel: Groups of courtiers or nobles Rich colors: deep blues, reds, golds, and ochres Gothic architectural elements including towers and arches Latin text in two columns Red lettering (rubrics) for important headings and initial words Black text for the main chronicle Beautiful calligraphy in Gothic script https://en.wikipedia.org/wiki/Illuminated_manuscript#/media/File:K%C3%A9pes_kr%C3%B3nika_els%C5%91_lapja.jpg
First page of Chronicon Pictum, the "Illuminated Chronicle" from the court of King Louis the Great of Hungary from 1358. Main illumination shows: A multi-paneled scene within an elaborate architectural frame Central figure: A crowned king (likely representing Hungarian royalty) seated on a throne Left panel: Armed warriors or knights Right panel: Groups of courtiers or nobles Rich colors: deep blues, reds, golds, and ochres Gothic architectural elements including towers and arches Latin text in two columns Red lettering (rubrics) for important headings and initial words Black text for the main chronicle Beautiful calligraphy in Gothic script https://en.wikipedia.org/wiki/Illuminated_manuscript#/media/File:K%C3%A9pes_kr%C3%B3nika_els%C5%91_lapja.jpg

CoMMA

CoMMA, a Large-scale Corpus of Multilingual Medieval Archives

We present CoMMA, a large-scale corpus of medieval manuscripts produced through automatic text recognition. The corpus contains around 3.3b tokens drawn from more than 32,700 digitized manuscripts in Latin and Old French, harvested via IIIF. Unlike other resources, it is made of raw, non-normalized text enriched with layout analysis in various formats. We describe the pipeline used for large-scale acquisition and processing, and report quantitative and qualitative evaluations (average CER 9.7%). The resulting resource supports multiple use cases, from pretraining language models to corpus linguistic on historical languages and digital humanities applications.
Medievalists.net

Over 32,000 medieval manuscripts transcribed in four months using AI - Medievalists.net

A new AI-powered tool that has transcribed over 32,000 medieval manuscripts in four months, giving researchers a vast, searchable corpus and new ways to examine historical sources.
  • Copy link
  • Flag this post
  • Block
65dBnoise
65dBnoise
@65dBnoise@mastodon.social replied  ·  activity timestamp 3 hours ago

@gutenberg_org
Yes, only they did not use #AI.
The paper itself doesn't even once mention the words "AI" or "Artificial Intelligence". Not once. And that's done purposefully by the researchers, who respect their own work and their colleagues/readers reading their paper. Their model used ModernBERT, not GPT or any other LLM.

The use of vague, over-hyped marketing buzzwords like "AI" blurs the work of the researchers who went into great lengths in their paper to describe how they did it.

#hype

  • Copy link
  • Flag this comment
  • Block

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.2-alpha.7 no JS en
Automatic federation enabled
Log in
  • Explore
  • About
  • Members
  • Code of Conduct