Arndt, Tracy; Arndt, Natanael: How to describe the past Web? A data model for web archiving. SWIB25 - Semantic Web in Libraries, ZBW - Leibniz-Informationszentrum Wirtschaft et al., 2025. https://doi.org/10.5446/72405
Journalists don’t just report from the web anymore—they report on it.
Learn the 9 Ways Web Archives Are Used In Digital Investigations in a new guest post by researchers from King’s College London who analyze 8,600 news articles to identify how journalists use the #WaybackMachine in digital investigations.🕵️♀️
Read Follow the Changes 👉 https://blog.archive.org/2026/02/02/follow-the-changes/
Journalists don’t just report from the web anymore—they report on it.
Learn the 9 Ways Web Archives Are Used In Digital Investigations in a new guest post by researchers from King’s College London who analyze 8,600 news articles to identify how journalists use the #WaybackMachine in digital investigations.🕵️♀️
Read Follow the Changes 👉 https://blog.archive.org/2026/02/02/follow-the-changes/
Tomorrow we will do a small input on #ArtDocArchive, which was a prototype for #webarchiving self documentation of artists on websites and social media (basically trying to preserve websites and feeds, extract information, and visualize it) at this event at nGbK in Berlin:
https://ngbk.de/en/programm/termine/eastunbloc-in-medias-rest
Website of the project: https://art-doc-archive.net/ There you find software and blog posts from the 4 month project.
📢 Hello Mastodon! 👋
We’re CiVers Citation of Versioned Web Pages by Persistent Identifier
Web pages change. Links rot. Academic references break. We’re fixing that. 🛠️
💻 CiVers develops software and methodologies to make web content reliably citable with PIDs and versioning
🔗 DFG-funded @dfg_public project at the DAI Berlin @dai_weltweit with Heidelberg University Library @uniheidelberg, GBV @vzg_gbv and DataCite @datacite
#WebArchiving #OpenScience #DigitalHumanities #PID #DataCite #CiVers
I heard it through the grapevine that the Library of Congress is accepting bids to become their #WebArchiving vendor. The documents provide a little window in on some of the details of how they currently do web archiving (transferring Bagit packages from S3) and the reports they generate to monitor it.
https://sam.gov/workspace/contract/opp/a2c5551af2b74c3d84c775032c83a55e/view
"The core technical challenge was clear: How do we allow investigators to move fast while maintaining a forensic chain of custody? A simple screenshot is insufficient for legal or historical proof. We needed a system where an investigator could claim, “at this time and date, I browsed this unique URL which contained precisely this content,“ and be able to back it up with cryptographic proof."
https://dispatch.starlinglab.org/p/pilot-project-on-making-web-preservation
duckdb-web-archive-cdx 👀
DuckDB extension to query web archive CDX APIs directly from SQL.
https://github.com/midwork-finds-jobs/duckdb-web-archive
#webarchiving
duckdb-web-archive-cdx 👀
DuckDB extension to query web archive CDX APIs directly from SQL.
https://github.com/midwork-finds-jobs/duckdb-web-archive
#webarchiving
wow, this is huge: World's largest Internet Domain Database https://ip.thc.org/
can be relevant for #webarchiving activities.
bulk data available, in parquet format: https://ip.thc.org/docs/bulk-data-access
"The core technical challenge was clear: How do we allow investigators to move fast while maintaining a forensic chain of custody? A simple screenshot is insufficient for legal or historical proof. We needed a system where an investigator could claim, “at this time and date, I browsed this unique URL which contained precisely this content,“ and be able to back it up with cryptographic proof."
https://dispatch.starlinglab.org/p/pilot-project-on-making-web-preservation
Wikidata and decentral archives:
At the workshop Subverting Archival Practices (https://perfomap.de/newspictures/workshop_berlin_nov.pdf), I presented my thoughts on the role of #Wikidata for the now very widespread online archives of projects as websites (e.g., https://talkingobjectsarchive.org/ https://dekoloniale.de/ https://archive.transmediale.de/ https://archiv.ngbk.de/). Linking persons, events, places, objects on these websites to Wikidata would make them more visible, but also allow for research across these different archives.
A new aspect here is also Wikidata's possible role for #webarchiving, i.e. the storage of endangered websites in containers, namely by enabling references to individual pieces of information in the container once the original website is no longer available.
This was also tested with students of @magdalena at Aarhus university in October in the workshop "Data Modelling and Web Content at Risk", they referenced information from the https://dekoloniale.de/ website on Wikidata: https://cc.au.dk/en/c4cdp/news/show/artikel/data-modelling-and-web-content-at-risk-students-of-the-curating-data-course-explore-digital-preservation-through-wikidata-and-linked-open-data
Wikidata and decentral archives:
At the workshop Subverting Archival Practices (https://perfomap.de/newspictures/workshop_berlin_nov.pdf), I presented my thoughts on the role of #Wikidata for the now very widespread online archives of projects as websites (e.g., https://talkingobjectsarchive.org/ https://dekoloniale.de/ https://archive.transmediale.de/ https://archiv.ngbk.de/). Linking persons, events, places, objects on these websites to Wikidata would make them more visible, but also allow for research across these different archives.
A new aspect here is also Wikidata's possible role for #webarchiving, i.e. the storage of endangered websites in containers, namely by enabling references to individual pieces of information in the container once the original website is no longer available.
This was also tested with students of @magdalena at Aarhus university in October in the workshop "Data Modelling and Web Content at Risk", they referenced information from the https://dekoloniale.de/ website on Wikidata: https://cc.au.dk/en/c4cdp/news/show/artikel/data-modelling-and-web-content-at-risk-students-of-the-curating-data-course-explore-digital-preservation-through-wikidata-and-linked-open-data
🤔 Curious about how CiVers will work?
📹 We’ve shared a new (German-language!) video that walks you through the core idea behind CiVers and shows the very first steps of our software development.
If you want to understand what we’re building, and why versioning and PIDs matter, this is a great place to start!
👉 Watch the video:
https://www.youtube.com/watch?v=Jje8Kb8xJL0
📣 New blog post! 📝
October 14, we hosted our first #CiVers workshop at the @dai_weltweit in Berlin 🏛️ This was a great opportunity to exchange ideas on citing versioned web resources and managing research data in #archaeology and the #humanities
Read more about what we discussed 👇
🔗 https://www.dainst.org/blogs/noslug/253
#Metadata #Research #OpenScience #DigitalPreservation #WebArchiving #DigitalHumanities
‼️ ATTENTION ‼️
⏰ Proposals for #iipcWAC26, "Sustainable #WebArchiving," in Brussels are due in 1 WEEK (15 OCT): http://netpreserve.org/ga2026/CfP
First-time submitters encouraged! Need inspiration?
https://www.youtube.com/@iipc8855/featured
#WebArchives #WebArchiveWednesday #DigitalPreservation #DigitalHumanities
@webarchives
‼️ ATTENTION ‼️
⏰ Proposals for #iipcWAC26, "Sustainable #WebArchiving," in Brussels are due in 1 WEEK (15 OCT): http://netpreserve.org/ga2026/CfP
First-time submitters encouraged! Need inspiration?
https://www.youtube.com/@iipc8855/featured
#WebArchives #WebArchiveWednesday #DigitalPreservation #DigitalHumanities
@webarchives
I started crawling 80,000,000 web pages for a personal project August 1 after an initial trial "PING-ing" 40,000,000 of them.
Cloudfare limiting is triggered at least twice a day, and almost immediately if I up my own rates.
Currently at 16% as of September 17.
Has rate limiting severely limited web archiving capability in the advent of AI?
Should I be crawling from multiple IP addresses?
What volume pages are likely to be crawled in day-to-day archiving processes?