Analyzing HN Readers' Personal Blogs Part 1
On April 7th 2020, An ASK HN post bubbled up to the front page of HN titled “Ask HN: What is your blog and why should I read it?” https://news.ycombinator.com/item?id=22800136
It was great reading through all of the comments of HN readers posting about their blogs/personal sites, why the write, and what they write about.
It was really inspiring to read and relate to everyone. It also inspired me to analyze everyone’s personal website. What technologies are HN people using? What does a typical HN personal site look like?
The initial step was straightforward enough. I though about using the HN API, but just ended up copy and pasting all of the text from the entire post and then using regex on the command line to spit out a list of URLS.
I then did some manual data cleansing. Removing HN links, twitter links, duplicates etc until I only had unique top level domains of personal blogs.
547 - Initial raw URLS -> 382 - Unique blog URLS
Initial Analysis: Wappalyzer
For this initial analysis I used the open sourced NPM module of Wappalyzer. They do have a paid version on their website https://www.wappalyzer.com/ if you do not want to deal with the CLI. [No relationship]
I made a bash script that went through my list of 382 unique URLS and then saving the outputs to individual JSON files.
From this process I returned 370/382 with status code 200 and data from Wappalyzer. The other URLS were discarded. Too lazy to redo them or manually check.
I then pulled up a trusty instance of Jupyter notebooks using pipenv and pulled all the data into a single glorious dataframe with 2315 rows.
- I ran this data very quickly
- This is my first time using Wappalyzer in this way
- I’m not sure how accurate or what the limitations are with the Wappalyzer open source tool
- Always take everything with a few grains of salt…life tastes better that way
And Last, but not least…..the “Insights”
Analytics Usage (Google is listening…)
- 224/370 = 61% Use some form of analytics or tracking software on their blogs
- 174/370 = 47% Use Google Analytics on their blogs
Advertising (It’s not a hobby dammit!)
- Only a small fraction of the sites have some detectable advertising framework
- 27/370 = 7% - with the majority being Google AdSense
|DoubleClick Ad Exchange (AdX)||1|
Content Management Systems (WP is still dominant)
- 121/370 = 33% use a content management system that was detected by Wappalyzer. I’m sure more people are using CMS on the backend to manage posts locally with static sites as well.
- 76/370 = 21% use WordPress
Web servers (Mostly Apache and Nginx)
- 91/370 = 25% Use Nginx as a reverse proxy for their site
- OpenGSE (Google Open Source Blog) is the most popular “non-traditional” web server with 10/370
Platform as a Service (Github Pages and Netlify)
- 51/370 = 14% of blogs use Netlify for hosting and CDN. [I do too :)]
- 59/370 = 16% of blogs use Github Pages
|Amazon Web Services||18|
- 47/370 = 13% use Bootstrap
|Material Design Lite||1|
Static Site Generators
I personally use and love Jekyll. Great that 82/370 = 22% of sites are using some form of static site generation. HTML > everything.
|Static Site Generators||Count|
Link to raw CSV file for your own enjoyment.
What else would be fun?
In future parts of this series it would be fun to look at
- The speed and performance of loadings these websites.
- The average “size” or “weight” of these websites
- Number of posts on the blog
- Type of content
- The frequency of posts
- Do they have RSS!??
- Automate the data collection using the HN API
Discussion on Hacker News
Hope you enjoyed this, Danny
Thank you @stevekemp and @stared for the edits and suggestions