Analyzing HN Readers' Personal Blogs Part 1

On April 7th 2020, An ASK HN post bubbled up to the front page of HN titled “Ask HN: What is your blog and why should I read it?” https://news.ycombinator.com/item?id=22800136

It was great reading through all of the comments of HN readers posting about their blogs/personal sites, why the write, and what they write about.

It was really inspiring to read and relate to everyone. It also inspired me to analyze everyone’s personal website. What technologies are HN people using? What does a typical HN personal site look like?

Data Collection

The initial step was straightforward enough. I though about using the HN API, but just ended up copy and pasting all of the text from the entire post and then using regex on the command line to spit out a list of URLS.

I then did some manual data cleansing. Removing HN links, twitter links, duplicates etc until I only had unique top level domains of personal blogs.

547 - Initial raw URLS -> 382 - Unique blog URLS

Initial Analysis: Wappalyzer

For this initial analysis I used the open sourced NPM module of Wappalyzer. They do have a paid version on their website https://www.wappalyzer.com/ if you do not want to deal with the CLI. [No relationship]

I made a bash script that went through my list of 382 unique URLS and then saving the outputs to individual JSON files.

From this process I returned 370/382 with status code 200 and data from Wappalyzer. The other URLS were discarded. Too lazy to redo them or manually check.

Data Wrangling

I then pulled up a trusty instance of Jupyter notebooks using pipenv and pulled all the data into a single glorious dataframe with 2315 rows.

Disclaimer

  1. I ran this data very quickly
  2. This is my first time using Wappalyzer in this way
  3. I’m not sure how accurate or what the limitations are with the Wappalyzer open source tool
  4. Always take everything with a few grains of salt…life tastes better that way

And Last, but not least…..the “Insights”

Analytics Usage (Google is listening…)

  1. 224/370 = 61% Use some form of analytics or tracking software on their blogs
  2. 174/370 = 47% Use Google Analytics on their blogs
Analytics toolCount
Google Analytics174
Parse.ly13
TrackJs7
Optimizely6
Clicky4
Matomo4
New Relic3
Statcounter3
BugSnag2
Segment2
Simple Analytics2
WP-Statistics2
Gauges1
Intercom1
Grand Total224

Advertising (It’s not a hobby dammit!)

  1. Only a small fraction of the sites have some detectable advertising framework
  2. 27/370 = 7% - with the majority being Google AdSense
AdvertisingCount
Google AdSense17
Carbon Ads5
BuySellAds4
DoubleClick Ad Exchange (AdX)1
Grand Total27

Content Management Systems (WP is still dominant)

  1. 121/370 = 33% use a content management system that was detected by Wappalyzer. I’m sure more people are using CMS on the backend to manage posts locally with static sites as well.
  2. 76/370 = 21% use WordPress
CMSCount
WordPress76
Ghost17
Medium13
Blogger10
Wix1
Joomla1
Squarespace1
Svbtle1
Tumblr1
Grand Total121

Web servers (Mostly Apache and Nginx)

  1. 91/370 = 25% Use Nginx as a reverse proxy for their site
  2. OpenGSE (Google Open Source Blog) is the most popular “non-traditional” web server with 10/370
Web ServersCount
Nginx91
Apache40
OpenGSE10
Cowboy7
LiteSpeed5
OpenResty5
Caddy4
Now2
lighttpd1
Phusion Passenger1
Grand Total166

Platform as a Service (Github Pages and Netlify)

  1. 51/370 = 14% of blogs use Netlify for hosting and CDN. [I do too :)]
  2. 59/370 = 16% of blogs use Github Pages
PaaSCount
GitHub Pages59
Netlify51
Automattic19
Amazon Web Services18
SiteGround5
Flywheel2
WP Engine1
Grand Total155

Programming Languages

Programming LanguagesCount
PHP83
Ruby67
Node.js45
Python13
Java11
Erlang7
Lua5
Go4
Perl1
Grand Total236

UI Frameworks

  1. 47/370 = 13% use Bootstrap
UI FrameworksCount
Bootstrap47
animate.css5
ZURB Foundation3
Pure CSS1
Material Design Lite1
Bulma1
Grand Total58

Static Site Generators

I personally use and love Jekyll. Great that 82/370 = 22% of sites are using some form of static site generation. HTML > everything.

Static Site GeneratorsCount
Hugo41
Jekyll24
Gatsby12
Hexo3
Pelican1
VuePress1
Grand Total82

Raw Data

Link to raw CSV file for your own enjoyment.

Download here

What else would be fun?

In future parts of this series it would be fun to look at

  1. The speed and performance of loadings these websites.
  2. The average “size” or “weight” of these websites
  3. Number of posts on the blog
  4. Type of content
  5. The frequency of posts
  6. Do they have RSS!??
  7. Automate the data collection using the HN API

Discussion on Hacker News

https://news.ycombinator.com/item?id=22822401

Hope you enjoyed this, Danny

Thank you @stevekemp and @stared for the edits and suggestions

· data analysis, viz, python, project