During those days when I still read news (good ol' times) which is only a month ago, I was an avid lurker on Hacker News which is a great source of technology related news with a great community. It works in a similar way to Reddit with people submitting articles or asking questions, and other people voting them up.
In this series of blog posts, I aim to scrape data from Hacker News submissions and analyze the data to investigate all there is to investigate about the site such as which domains people upvote the most, which domains are the least likely to gain success, and so on.
It will not be an in-depth tutorial but should hopefully cover enough material so that anyone can replicate my scraping tactics and analysis techniques.
Setting up the scraper 🤖
Fortunately for us, Hacker News, despite its archaic HTML layout, is a very simple site with no AJAX loading and other fancy bits. Thus, scraping the site is extremely easy.
The entire scraper code fits into 45 lines, with another 45 dedicated to setting up the database and saving the results. Let's start at the top by importing all the necessary modules needed for the project:
import requests import datetime from sqlalchemy import Column, Integer, String, DateTime, create_engine, ForeignKey from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import sessionmaker from bs4 import BeautifulSoup import os
Nothing fancy there. We will use
requests module to fetch the HTML, and
BeautifulSoup to parse it. Also, with the help of
SQLAlchemy we are going to use SQLite to store the results for analysis.
Now, let's set up the database using SQLAlchemy's recommended Object Relational Mapper. Note that we use an absolute path to our database by using the environment variable
HN_SCRAPER_HOME. This is necessary because later we will run our scraper bot with
crontab which will run from the root directory and will save the database in the wrong place if a relative path was used.
engine = create_engine('sqlite:///'+os.environ['HN_SCRAPER_HOME']+'/database.db') Base = declarative_base()
Now that we have our database setup, let's setup the tables for tracking Hacker News submissions.
Story_Info will act as a reference table that contains story submissions title, URL, a user that submitted the story, version and fetch date. The version number is kept because often stories title or link is modified. so we want to track if anything has changed and if it did, save a new version of the submission leaving the previous one intact.
Story_Tracker will be the fatty table here where we will keep track of stories rank, comments count and points at every given interval. Along with that, we will store whether the story is on the submission page, or the front page. This will allow us to analyze how many of the stories make to the front page, at what speed and so on.
class Story_Info(Base): __tablename__ = 'info' id = Column(Integer, primary_key=True) title = Column(String) url = Column(String) user = Column(String) version = Column(Integer, primary_key=True) fetch_date = Column(DateTime, default=datetime.datetime.utcnow) def __repr__(self): return "<Story_Info(id='%d',title='%s',url='%s',user='%s',version='%d')>" % ( self.id,self.title,self.url,self.user,self.version) class Story_Tracker(Base): __tablename__ = 'tracker' id = Column(Integer, ForeignKey('info.id'), primary_key=True) score = Column(Integer) rank = Column(Integer) front_page = Column(Integer) fetch_date = Column(DateTime, primary_key=True, default=datetime.datetime.utcnow) def __repr__(self): return "<Story_Tracker(id='%d',score='%s',rank='%s')>" % ( self.id,self.score,self.rank)
Now comes the fun part, the HTML parser itself. The script is contained within a
fetch_stories function which accepts two parameters - a
URL from which to scrape submissions, and
FRONT_PAGE indicating whether the stories in the link should be considered as being on the front page or not. I will not discuss the rest but I would gladly answer questions if anyone is curious 😉
def fetch_stories(url, front_page): html = requests.get(url) parser = BeautifulSoup(html.text, 'html.parser') story_links = parser.select("tr.athing") for idx, story in enumerate(story_links): info = story_links[idx].find_next_sibling("tr") id = story.get("id") rank = story_links[idx].select(".rank").text score = info.select(".score") if score: score = score.text.split(" ") else: score = 0 user = info.select(".hnuser") if user: user = user.text else: user = "" story_link = story.select(".storylink") url = story_link.get("href") title = story_link.text story_info = session.query(Story_Info).order_by(Story_Info.version.desc()).filter_by(id=int(id)).first() if story_info == None: story = Story_Info(id=int(id),title=title,url=url,user=user,version=1) session.add(story) else: any_changes = False if title != story_info.title or url != story_info.url or user != story_info.user: new_version = story_info.version + 1 story = Story_Info(id=int(id),title=title,url=url,user=user,version=new_version) session.add(story) story = Story_Tracker(id=int(id),score=int(score),rank=int(score),front_page=front_page) session.add(story) session.commit() fetch_stories(url="https://news.ycombinator.com/",front_page=1) fetch_stories(url="https://news.ycombinator.com/newest",front_page=0)
Full code available at HN Scraper Analysis Github repository.
Running the scraper 🖥️
Now that we have code for the scraper, we need to set it up on the server and leave it running. I did it via a free EC2 micro tier instance which should suffice for our needs.
Once EC2 instance is ready, transfer the files to the server as such:
scp -i key.pem /path/to/parser.py email@example.com:~
Then SSH into the instance and setup the environment as such:
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh bash Anaconda3-4.2.0-Linux-x86_64.sh export PATH=~/anaconda3/bin:$PATH
The above will install Anaconda which is a truly awesome open source distribution of Python that contains close to anything you may need for big-data processing.
Finally, we need to set up the crontab so that it runs our script at a given interval, in this case, every 2 minutes.
Note that since in our script we use absolute path taken from an environment variable, we need to declare that as seen below. Also, we must make sure to use Anaconda's package and not the default Python found on Ubuntu systems. This can be done easily by specifying the
python executable file. Finally, we save any errors in a
parser-log.log file which will make debugging easier if anything nasty comes up.
HN_SCRAPER_HOME = /home/ubuntu/ */2 * * * * /home/ubuntu/anaconda3/bin/python /home/ubuntu/parser.py >> /home/ubuntu/parser-log.log 2>&1
Once our scraper has run, the data will become available for inspection at
One gotcha that I encountered is that new Ubuntu instances often fail to run crontab which is caused by the error
no MTA is installed. So if you see that happening, execute the following command
sudo apt-get install postfix and upon installation choose the
local user option.
Will the database handle the data?
One question that worried me is whether the EC2 instance or the SQLite database will be able to handle the amount of data that we are going to scrape. SQLite has a theoretical limit of 140TB which is obviously going to be more than enough. On the other hand, AWS t2.micro instance has only 1GB of memory available which may be problematic.
To see if our EC2 instance will be big enough, let's first consider how many fetches of data we will perform. Since there are 86,400 minutes in 60 days, and we fetch new data every 2 minutes, we will execute 43200 requests to HN in total. Also, fortunately for us, each fetch of data only carries a measly 4kb of data meaning that all 43,200 requests will only take 43,200*4=172800kb which is just above 172mb. So we are safe! 🙌
Now, we should have our scraper bot running successfully. All that is left for now is leave it alone to do its job and come back in 60 days with the data ready to analyze. Meanwhile, enjoy life! 🌻