Post · bonfire.cafe

This may be a dumb question, but ... Is there an approach/pattern for publishing #parquet files for use with DuckDB, Trino, etc that allows for data to be updated over time? Do you just publish the updated data, and push removing old records into the query step?

Timothy

@arthegall@mastodon.roundpond.net replied · 2 months ago

@edsu if you’re not publishing separate snapshots, then yeah, you’re going to have to treat records in parquet as records of writes, and resolve the last one (including tombstones) at query time

Honestly this is kind of the thing that work like Iceberg and Delta Lake (among others) are meant to solve

Ed Summers

@edsu@social.coop replied · 2 months ago

@arthegall thanks, this was super helpful!

Ed Summers

@edsu@social.coop replied · 2 months ago

@arthegall also, it sounded like you might have some experience with said tools? No pressure, but I would appreciate any more pointers/preferences as someone who has a general understanding and is trying to wade through the morass of information about data lakes, etc.

Timothy

@arthegall@mastodon.roundpond.net replied · 2 months ago

@edsu I've seen some of these tools in action, sure -- I'm no expert. I'm a few years away from regular Spark-usage (and some of these formats are mostly Spark-accessible) have dabbled in others (Polars, Data Fusion), and others are a regular part of my professional work (Parquet, duckdb, pyarrow). I'd always be happy to answer questions, or give opinions, if it would ever be helpful.

I also saw ducklake the other day, for the first time, and marked it down as "warrants investigation" :-)

Ed Summers

@edsu@social.coop replied · 2 months ago

@arthegall hmm, ducklake looks kind of interesting too, at least for the scale I am working at

Andy Jackson

@anj@digipres.club replied · 2 months ago

@edsu I'm aware of the pattern of using folders to partition parquet databases. You can eg break down partitions by date and add new data as it comes in. https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files but that works best for appending data. Not sure how it works for modifying data.

Ed Summers

@edsu@social.coop replied · 2 months ago

@anj right I had seen that too. I guess if you append the modified data you can try to limit to the latest version of a record as you are querying the data? But it would add an extra bit of complexity that you'd have to remember.

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances

Bonfire social · 1.0.0 no JS en

Automatic federation enabled