Discussion
Loading...

Post

Log in
  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
raffaele
raffaele
@raffaele@digipres.club  路  activity timestamp 2 weeks ago

duckdb-web-archive-cdx 馃憖
DuckDB extension to query web archive CDX APIs directly from SQL.

https://github.com/midwork-finds-jobs/duckdb-web-archive
#webarchiving

GitHub

GitHub - midwork-finds-jobs/duckdb-web-archive: DuckDB extension to fetch pages from common crawl

DuckDB extension to fetch pages from common crawl. Contribute to midwork-finds-jobs/duckdb-web-archive development by creating an account on GitHub.
  • Copy link
  • Flag this post
  • Block
Ed Summers
Ed Summers
@ink@merveilles.town replied  路  activity timestamp 2 weeks ago

@raffaele wow, this is sick! It also lets you run CSS selectors on archived HTML content?!

In my brief experimentation it looks like the `max_results` option is important, depending on what kind of query you are doing.

A SQL query of Wayback Machine, with max_results set to 1,000,000, to count the number of snapshots of the inkdroid.org page and group them by year. This
A SQL query of Wayback Machine, with max_results set to 1,000,000, to count the number of snapshots of the inkdroid.org page and group them by year. This
A SQL query of Wayback Machine, with max_results set to 1,000,000, to count the number of snapshots of the inkdroid.org page and group them by year. This
A SQL query of Wayback machine to count the number of snapshots of the inkdroid.org page and group them by year.
A SQL query of Wayback machine to count the number of snapshots of the inkdroid.org page and group them by year.
A SQL query of Wayback machine to count the number of snapshots of the inkdroid.org page and group them by year.
  • Copy link
  • Flag this comment
  • Block
Ed Summers
Ed Summers
@ink@merveilles.town replied  路  activity timestamp 2 weeks ago

@raffaele neato, it also supports LIKE queries with a URL prefix:

A table of results querying for the count of snapshots of any page at inkdroid.org:

select strftime(timestamp::timestamp, '%Y') as year, count(*) from wayback_machine(max_results=10000000) where url like 'inkdroid.org/%' group by all order by year;
A table of results querying for the count of snapshots of any page at inkdroid.org: select strftime(timestamp::timestamp, '%Y') as year, count(*) from wayback_machine(max_results=10000000) where url like 'inkdroid.org/%' group by all order by year;
A table of results querying for the count of snapshots of any page at inkdroid.org: select strftime(timestamp::timestamp, '%Y') as year, count(*) from wayback_machine(max_results=10000000) where url like 'inkdroid.org/%' group by all order by year;
  • Copy link
  • Flag this comment
  • Block
raffaele
raffaele
@raffaele@digipres.club replied  路  activity timestamp 2 weeks ago

@ink It seems a very useful extension. I haven't tested it thoroughly, as the CDX API often times out from Italy today.

  • Copy link
  • Flag this comment
  • Block

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About 路 Code of conduct 路 Privacy 路 Users 路 Instances
Bonfire social 路 1.0.1-alpha.40 no JS en
Automatic federation enabled
Log in
  • Explore
  • About
  • Members
  • Code of Conduct