Discussion
Loading...

Post

Log in
  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
ink
ink
@ink@merveilles.town  ·  activity timestamp 2 weeks ago

I heard it through the grapevine that the Library of Congress is accepting bids to become their #WebArchiving vendor. The documents provide a little window in on some of the details of how they currently do web archiving (transferring Bagit packages from S3) and the reports they generate to monitor it.

https://sam.gov/workspace/contract/opp/a2c5551af2b74c3d84c775032c83a55e/view

  • Copy link
  • Flag this post
  • Block
ink
ink
@ink@merveilles.town replied  ·  activity timestamp 2 weeks ago

It's interesting they want to be delivered a sample crawl to evaluate. You can see the list of seeds they want crawled over a 48 hour period in the attached Word file:

* J2b - Sample_Web_Crawl_Seed_List_20250917 - (pp. 13).docx

  • Copy link
  • Flag this comment
  • Block
Ashley Blewer
Ashley Blewer
@ashley@digipres.club replied  ·  activity timestamp 2 weeks ago

@ink ooooh nice find 👀

  • Copy link
  • Flag this comment
  • Block
Nemo_bis 🌈
Nemo_bis 🌈
@nemobis@mamot.fr replied  ·  activity timestamp 2 weeks ago

@ink Interesting! From a quick read, WARC/CDX/BagIt are required, Heritrix not.

«large, standard (e.g. Heritrix) crawl, and a smaller browser-based crawl [...] around 16,500 seed URLs across all crawls is 70/30 standard to browser-based»

«Contractor must delete all copies of captured content stored by the Contractor.» Ok for password-protected URLs, strange otherwise.

Background checks for employees.

"Contractor publicity" is why the IA does not (?) publicly say they do this.

#digipres

  • Copy link
  • Flag this comment
  • Block
ink
ink
@ink@merveilles.town replied  ·  activity timestamp 2 weeks ago

@nemobis Yes there are probably .gov rules around that. I'm pretty certain LoC switched to MirrorWeb away from Internet Archive ~5 years ago? I don't remember when and it's hard to find info about it quickly on the internets (for said reasons perhaps).

  • Copy link
  • Flag this comment
  • Block
Nemo_bis 🌈
Nemo_bis 🌈
@nemobis@mamot.fr replied  ·  activity timestamp 2 weeks ago

@ink Ah! I didn't know, I'm out of the loop.

  • Copy link
  • Flag this comment
  • Block
ink
ink
@ink@merveilles.town replied  ·  activity timestamp 2 weeks ago

@nemobis it was IA for a looong time. I think I only found out it had changed pretty recently.

  • Copy link
  • Flag this comment
  • Block

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.2-alpha.7 no JS en
Automatic federation enabled
Log in
  • Explore
  • About
  • Members
  • Code of Conduct