\

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

164 points - last Saturday at 5:12 PM

Source
  • vovavili

    today at 7:45 PM

    Replacing an 11.6GB Parquet file every 5 minutes strikes me as a bit wasteful. I would probably use Apache Iceberg here.

      • ai-inquisitor

        today at 7:54 PM

        It's not doing that. If you look at the repository, it's adding a new commit with tiny parquet files every 5 minutes. This recent one only was a 20.9 KB parquet file: https://huggingface.co/datasets/open-index/hacker-news/commi... and the ones before it were a median of 5 KB: https://huggingface.co/datasets/open-index/hacker-news/tree/...

        The bigger concern is how large the git history is going to get on the repository.

          • vovavili

            today at 8:04 PM

            This makes more sense. I still wonder if the author isn't just effectively recreating Apache Iceberg manually here.

              • tomrod

                today at 8:05 PM

                Are they paying for the repo space, I wonder?

        • zerocrates

          today at 7:56 PM

          "The dataset is organized as one Parquet file per calendar month, plus 5-minute live files for today's activity. Every 5 minutes, new items are fetched from the source and committed directly as a single Parquet block. At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory."

          So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.

            • tomrod

              today at 8:06 PM

              Parquet is a very efficient storage approach. Data interfaces tend to treat paths as partitions, if logical.

          • fabmilo

            today at 7:47 PM

            Was thinking the same thing. probably once a day would be more than enough. if you really want a minute by minute probably a delta file from the previous day should be more than enough.

        • xnx

          today at 5:56 PM

          The best source for this data used to be Clickhouse (https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...), but it hasn't updated since 2025-12-26.

          • robotswantdata

            today at 7:35 PM

            Where’s the opt out ?

              • john_strinlai

                today at 7:38 PM

                hackernews is very upfront that they do not really care about deletion requests or anything of that sort, so, the opt out is to not use hackernews.

                • ratg13

                  today at 7:52 PM

                  Create a new account every so often, don’t leave any identifying information, occasionally switch up the way you spell words (British/US English), and alternate using different slang words and shorthand.

                    • fdghrtbrt

                      today at 8:00 PM

                      And do what I do - paste everything into ChatGPT and have it rephrase it. Not because I need help writing, but because I’d rather not have my writing style used against me.

                        • socksy

                          today at 8:10 PM

                          I can't stand this and will actively discriminate against comments I notice in that voice. Even this one has "Not because [..], but because [..]"

                  • tantalor

                    today at 7:46 PM

                    The back button

                • gkbrk

                  today at 5:51 PM

                  My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?

                    • 0cf8612b2e1e

                      today at 6:00 PM

                      Sorting, compression algorithm +level, and data types can all have an impact. I noted elsewhere that a Boolean is getting represented as an integer. That’s one bit vs 1-4 bytes.

                      There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.

                      • xnx

                        today at 5:57 PM

                        Parquet has a few compression option. Not sure which one they are using.

                          • hirako2000

                            today at 6:04 PM

                            Plus isn't the least wasteful format, native duckdb for instance compacts better. That's not just down to the compression algorithm, which as you say got three main options for parquet.

                    • maxloh

                      today at 7:56 PM

                      Could you also release the source code behind the automatic update system?

                      • epogrebnyak

                        today at 7:47 PM

                        Wonder why median votes count is 0, seems every post is getting at least a few votes - maybe this was not the case in the past

                          • epogrebnyak

                            today at 7:49 PM

                            Ahhh I get it the moment I asked, there are usually no votes on comments

                        • politician

                          today at 8:10 PM

                          This is great. I've soured on this site over the past few years due to the heavy partisanship that wasn't as present in the early days (eternal September), but there are still quite a few people whose opinions remain thought-provoking and insightful. I'm going to use this corpus to make a local self-hosted version of HN with the ability to a) show inline article summaries and b) follow those folks.

                          • imhoguy

                            today at 7:48 PM

                            Yay! So much knowledge in just 11GB. Adding to my end of the World hoarding stash!

                            • brtkwr

                              today at 7:44 PM

                              This comment should make it into the download in a few mins.

                                • tantalor

                                  today at 7:45 PM

                                  As should this reply

                              • mlhpdx

                                today at 6:15 PM

                                Static web content and dynamic data?

                                > The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.

                                That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.

                                  • voxic11

                                    today at 7:29 PM

                                    That is just the archive part, if you just would finish reading the paragraph you would know that updates since 2026-03-16 23:55 UTC are "are fetched every 5 minutes and committed directly as individual Parquet files through an automated live pipeline, so the dataset stays current with the site itself."

                                    So to get all the data you need to grab the archive and all the 5 minute update files.

                                    archive data is here https://huggingface.co/datasets/open-index/hacker-news/tree/...

                                    update files are here (I know that its called "today" but it actually includes all the update files which span multiple days at this point) https://huggingface.co/datasets/open-index/hacker-news/tree/...

                                      • john_strinlai

                                        today at 7:33 PM

                                        >if you just would finish reading the paragraph

                                        probably uncalled for

                                    • xandrius

                                      today at 7:17 PM

                                      I don't get what you meant with this comment.

                                        • john_strinlai

                                          today at 7:25 PM

                                          the data updates every 5 minutes, but the description on huggingface says the last update was 2 days ago.

                                          they are suggesting that the huggingface description should be automatically updating the date & item count when the data gets updated.

                                            • voxic11

                                              today at 7:30 PM

                                              No that is the date at which the bulk archive ends and the 5 minute update files begin, so it should not be updated.

                                                • today at 7:34 PM

                                  • kshacker

                                    today at 7:04 PM

                                    Good for demo but every 5 minutes? Why?

                                      • Imustaskforhelp

                                        today at 7:18 PM

                                        It can have some good use cases I can think of. Personally I really appreciate the 5 minute update.

                                    • alstonite

                                      today at 6:52 PM

                                      What happened between 2023 and 2024 to cause the usage dropoff?

                                        • ghgr

                                          today at 6:56 PM

                                          I'd say it's less a usage dropoff and more a reversion to the mean after Covid

                                            • tehjoker

                                              today at 7:04 PM

                                              That's a possible hypothesis, but there was also a rising trend prior, it wasn't stable.

                                          • imhoguy

                                            today at 7:41 PM

                                            Return to office

                                        • lyu07282

                                          today at 6:56 PM

                                          Please upload to https://academictorrents.com/ as well if possible

                                          • palmotea

                                            today at 5:34 PM

                                            > At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.

                                            Wouldn't that lose deleted/moderated comments?

                                              • BoredPositron

                                                today at 6:57 PM

                                                I guess that's the point.

                                                  • Imustaskforhelp

                                                    today at 7:20 PM

                                                    Can't someone create an automatic script which can just copy the files say 5 minutes before midnight UTC?

                                            • 0cf8612b2e1e

                                              today at 5:50 PM

                                              Under the Known Limitations section

                                                deleted and dead are integers. They are stored as 0/1 rather than booleans.
                                              
                                              Is there a technical reason to do this? You have the type right there.

                                              • Imustaskforhelp

                                                today at 7:17 PM

                                                As someone who had made a project analysing hackernews who had used clickhouse, I really feel like this is a project made for me (especially the updated every 5 minute aspect which could've helped my project back then too!)

                                                Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.

                                                I had thought of making a ping website where people can just @Username and a service which can detect it and then send mail to said username if the username has signed up to the service (similar to a service run by someone from HN community which mails you everytime someone responds to your thread directly, but this time in a sort of ping)

                                                [The previous idea came as I tried to ping someone to show them something relevant and thought that wait a minute, something like ping which mails might be interesting and then tried to see if I can use algolia or any service to hook things up but not many/any service made much sense back then sadly so I had the idea in back of my mind but this service sort of solves it by having it being updated every 5 minutes]

                                                Your 5 minute updates really make it possible. I will look what I can do with that in some days but I am seeing some discrepancy in the 5 minute update as last seems to be 16 march in the readme so I would love to know more about if its being updated every 5 minutes because it truly feels phenomenal if true and its exciting to think of some new possibilities unlocked with it.

                                                • tonymet

                                                  today at 6:58 PM

                                                  what's the license for HN content?

                                                    • echelon

                                                      today at 7:04 PM

                                                      At this point, you can train on anything without repercussion.

                                                      Copyright doesn't seem to matter unless you're an IP cartel or mega cap.

                                                        • marginalia_nu

                                                          today at 7:09 PM

                                                          Laughs nervously in jurisdiction without fair use doctrine

                                                  • Onavo

                                                    today at 5:22 PM

                                                    Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.

                                                      • nelsondev

                                                        today at 5:25 PM

                                                        It’s date partitioned, you could download just a date range. It’s also parquet, so you can download just specific columns with the right client

                                                    • lokimoon

                                                      today at 6:59 PM

                                                      You are the product

                                                        • waynesonfire

                                                          today at 7:44 PM

                                                          Your reward is the endorphin hit from writing this comment.

                                                      • bstsb

                                                        today at 5:27 PM

                                                        what’s the license? “do whatever the fuck you want with the data as long as you don’t get caught”? or does that only work for massive corporations

                                                          • BoredPositron

                                                            today at 7:02 PM

                                                            The universal license.

                                                        • GeoAtreides

                                                          today at 5:42 PM

                                                          is the legal page a placeholder, do words have no meaning?

                                                          https://www.ycombinator.com/legal/

                                                          Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)

                                                            • Retr0id

                                                              today at 5:51 PM

                                                              Which terms are not being enforced? (not disagreeing I just don't feel like reading a large legal document)

                                                                • GeoAtreides

                                                                  today at 6:02 PM

                                                                  > By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies

                                                                  The user content is supposed to be licensed only Y Combinator and (bleah) its affiliated companies (which are many, all the startups they fund, for example).

                                                                    • jmalicki

                                                                      today at 6:44 PM

                                                                      Curious why it should be on HackerNews to enforce restrictions on content they only license from you?

                                                                      If it's owned by you and only licensed by HN shouldn't you be the one enforcing it?

                                                                        • AndrewKemendo

                                                                          today at 7:02 PM

                                                                          Seems like they are trying to do that through the stated legal intermediary (YC)

                                                                      • zamadatix

                                                                        today at 6:50 PM

                                                                        If you carry on the quote two more words:

                                                                        > ... a nonexclusive

                                                                        I.e. this section is talking to additional rights to the content you post to ALSO go to YC, not that YC is guaranteeing it (+friends) will be the only one to hold these rights or will enforce who else should hold the rights to your publicly shared content for you.

                                                                        There's a more intricate conversation to be had with GDPR and public data on forums in general but that's wholly unrelated to what YC's legal page says and still unlikely to end up in an alarming result.

                                                                        • ryandvm

                                                                          today at 6:35 PM

                                                                          That agreement is largely about "Personal Information", not the posts and comments.

                                                                          That said, there are "no scraping" and "commercial use restricted" carve-outs for the content on HN. Which honestly is bullshit.

                                                                      • ungruntled

                                                                        today at 6:01 PM

                                                                        None that I could see:

                                                                        Your submissions to, and comments you make on, the Hacker News site are not Personal Information and are not "HN Information" as defined in this Privacy Policy.

                                                                        Other Users: certain actions you take may be visible to other users of the Services.

                                                                          • GeoAtreides

                                                                            today at 6:03 PM

                                                                            I mean, just because they say the comments are not PI doesn't make it so.

                                                                              • ungruntled

                                                                                today at 6:09 PM

                                                                                That’s a good point. I’m only referring to the terms they used in the privacy policy.

                                                                    • ryandvm

                                                                      today at 6:33 PM

                                                                      Eh, fuck that agreement. I'm kind of old school in that I believe if you put it on the internet without an auth-wall, people should be allowed to do whatever they want with it. The AI companies seem to agree.

                                                                      Then again, I'm not the guy that is going to get sued...

                                                                        • Ylpertnodi

                                                                          today at 6:51 PM

                                                                          > I believe if you put it on the internet without an auth-wall, people should be allowed to do whatever they want with it.

                                                                          I agree. It's the owners of the sites that have to follow rules, not us.

                                                                          • kmeisthax

                                                                            today at 6:57 PM

                                                                            "I'm kind of old school in that I believe if you put grass on the ground without a fence, people should be allowed to do whatever they want with it. The noblemen with a thousand cows seem to agree."

                                                                            And that, my friends, is how you kill the commons - by ignoring the social context surrounding its maintenance and insisting upon the most punitive ways of avoiding abuse.

                                                                              • petercooper

                                                                                today at 7:09 PM

                                                                                Context is important, but isn’t HN’s social context, in particular, that the site is entirely public, easily crawled through its API (which apparently has next to no rate limits) and/or Algolial, and has been archived and mirrored in numerous places for years already?

                                                                                • echelon

                                                                                  today at 7:06 PM

                                                                                  Signal and information are not grass.

                                                                                  Grass and property require upkeep. Radio waves and electromagnetic radiation do not.

                                                                                  I don't want your dog to piss on my lawn and kill my grass. But what harm does it cause me if you take a picture of my lawn? Or if I take a picture of your dog?

                                                                                  If I spend $100M making a Hollywood movie - pay employees, vendors, taxes - contribute to the economic growth of the country - and then that product gets stolen and given away completely for free without being able to see upside, that's a little bit different.

                                                                                  But my Hacker News comment? It's not money.

                                                                                  I think there are plausible ways to draw lines that protect genuine work, effort, and economics while allowing society and innovation to benefit from the commons.

                                                                              • hrmtst93837

                                                                                today at 7:30 PM

                                                                                [dead]

                                                                            • hsuduebc2

                                                                              today at 5:59 PM

                                                                              How is is he breaking gdpr here?

                                                                              • andrewmcwatters

                                                                                today at 5:50 PM

                                                                                They already refuse to comply with CPRA, instead electing to replace your username with a random 6(?) character string, prefixed with `_`, if I remember correctly.

                                                                                I know, because I've been here since maybe 2015 or so, but this account was created in 2019.

                                                                                So any PII you have mentioned in your comments is permanent on Hacker News.

                                                                                I would appreciate it if they gave users the ability to remove all of their personal data, but in correspondence and in writing here on Hacker News itself, Dan has suggested that they value the posterity of conversations over the law.

                                                                                  • stopbulying

                                                                                    today at 7:10 PM

                                                                                    [dead]