I Scraped 47M+ Hacker News Items Into Parquet Files – Here's What I Discovered About HN's Hidden Data Patterns
Last week, I stumbled upon an incredible dataset that made my data engineer heart skip a beat: a complete Hacker News archive containing over 47 million items compressed into just 11.6GB of Parquet...

Source: DEV Community
Last week, I stumbled upon an incredible dataset that made my data engineer heart skip a beat: a complete Hacker News archive containing over 47 million items compressed into just 11.6GB of Parquet files, updated every 5 minutes. After diving deep into this treasure trove of Silicon Valley's collective consciousness, I discovered some fascinating patterns that every developer should know about. If you've ever wondered what makes content go viral on HN, when the best time to post is, or how the community has evolved over the years, this dataset holds the answers. Let me walk you through what I found and how you can start analyzing HN data yourself. What Makes This Dataset Special? The Hacker News archive on Hugging Face isn't just another web scrape. It's a meticulously maintained collection that captures every story, comment, job posting, and Ask HN thread since HN's inception. What makes it particularly powerful is the Parquet format – a columnar storage format that's perfect for anal