Discussion
Loading...

Post

  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
Ed Summers
@edsu@social.coop  ·  activity timestamp 2 months ago

This may be a dumb question, but ... Is there an approach/pattern for publishing #parquet files for use with DuckDB, Trino, etc that allows for data to be updated over time? Do you just publish the updated data, and push removing old records into the query step?

  • Copy link
  • Flag this post
  • Block
Timothy
@arthegall@mastodon.roundpond.net replied  ·  activity timestamp 2 months ago
@edsu if you’re not publishing separate snapshots, then yeah, you’re going to have to treat records in parquet as records of writes, and resolve the last one (including tombstones) at query time

Honestly this is kind of the thing that work like Iceberg and Delta Lake (among others) are meant to solve

  • Copy link
  • Flag this comment
  • Block
Ed Summers
@edsu@social.coop replied  ·  activity timestamp 2 months ago
@arthegall thanks, this was super helpful!
  • Copy link
  • Flag this comment
  • Block
Ed Summers
@edsu@social.coop replied  ·  activity timestamp 2 months ago

@arthegall also, it sounded like you might have some experience with said tools? No pressure, but I would appreciate any more pointers/preferences as someone who has a general understanding and is trying to wade through the morass of information about data lakes, etc.

  • Copy link
  • Flag this comment
  • Block
Timothy
@arthegall@mastodon.roundpond.net replied  ·  activity timestamp 2 months ago
@edsu I've seen some of these tools in action, sure -- I'm no expert. I'm a few years away from regular Spark-usage (and some of these formats are mostly Spark-accessible) have dabbled in others (Polars, Data Fusion), and others are a regular part of my professional work (Parquet, duckdb, pyarrow). I'd always be happy to answer questions, or give opinions, if it would ever be helpful.

I also saw ducklake the other day, for the first time, and marked it down as "warrants investigation" :-)

  • Copy link
  • Flag this comment
  • Block
Ed Summers
@edsu@social.coop replied  ·  activity timestamp 2 months ago

@arthegall hmm, ducklake looks kind of interesting too, at least for the scale I am working at

  • Copy link
  • Flag this comment
  • Block
Andy Jackson
@anj@digipres.club replied  ·  activity timestamp 2 months ago
@edsu I'm aware of the pattern of using folders to partition parquet databases. You can eg break down partitions by date and add new data as it comes in. https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files but that works best for appending data. Not sure how it works for modifying data.
  • Copy link
  • Flag this comment
  • Block
Ed Summers
@edsu@social.coop replied  ·  activity timestamp 2 months ago
@anj right I had seen that too. I guess if you append the modified data you can try to limit to the latest version of a record as you are querying the data? But it would add an extra bit of complexity that you'd have to remember.
  • Copy link
  • Flag this comment
  • Block
Log in

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.0 no JS en
Automatic federation enabled
  • Explore
  • About
  • Members
  • Code of Conduct
Home
Login