I have previously written about my Glicko ratings of competitive Heroes of the Storm (HotS) teams, and how I put that information into a SQLite database. Up to this point though, I have been entering all of this information manually. In order to do the kind of analyses that I want to do, I really need to have access to a lot of data, and I really do not want to put all of into my database by hand.

I was also looking for an excuse to learn how to scrape websites for information and this seemed like the perfect opportunity to do it.

Enter Scrapy

I had a few requirements for my web scraper, although as usual the most important one was that the software is free and open source. I was looking for something in Python so that I would not have to learn a new programming language on top of a new tool. I also needed a scraper that could crawl the website for me so that I would not have to manually input every webpage.

The two big python packages in the web scraper world are BeautifulSoup and Scrapy (apologies to anyone working on something else). BeautifulSoup has a lot of built in tools to parse a webpage, but you have to know the webpages you want to scrape in advance. In the short term, this probably would have been a lot easier to use. Scrapy allows you to advance through pages automatically and is overall a far more robust tool than BeautifulSoup.

First Steps

There is a wonderful tutorial on using Scrapy on their website. I use Continuum's Anaconda Python, so the installation was as simple as
conda install scrapy. You might have to reboot to have scrapy added to your
PATH if you are using windows, but creating a Scrapy project is as simple
as

scrapy startproject hots

A Scrapy project starts out with the following structure

hots/
    scrapy.cfg          # configuration file
    hots/
        __init__.py
        items.py        # where you define your Scrapy items
        pipelines.py    # pipelines file, to put data in SQLite database
        settings.py     # settings file
        spiders/        # directory where spiders eventually go
            __init__.py

Picking a random match, here's what the initial spider looks like.

import scrapy 

class HotsSpider(scrapy.Spider):
    # Basic spider information
    name = "hots"  # does not need to match class name
    allowed_domains = ["masterleague.net"]  # variable name describes it well
    start_urls = [
        'http://masterleague.net/match/1918/'
    ]

    def parse(self):
        yield {
            'match_date': response.xpath('/html/body/div/ol/li[6]/text()').extract_first()
        } 

All the spider is returning right now is the match date, which is a great first
start but there is a lot more to do. Scrapy can collect information using two
different methods: Xpath and CSS. While CSS can work fairly well, Scrapy's own
website believes that XPath is a far more powerful way of identifying content on
a webpage. The quick overview is that all you need to do to identify the xpath
is to open the website in chrome, right click on the data you want to extract,
select inspect element, right click on it in the source code, and click on copy
XPath. For a better overview of how to extract data using XPath, check Google.
The extract_first() pulls out the first element of the array it returns, but
can handle empty data.

Next, we will add an item to items.py to help with data processing.

import scrapy

class HotsItem(scrapy.Item):
    # Basic match info
    match_id = scrapy.Field()
    match_date = scrapy.Field()
    map_name = scrapy.Field()
    home_team = scrapy.Field()
    ...

You should have a Field for every piece of data you want. I found it useful
to sketch out which fields I wanted first. If you want to see all of the
fields I actually ended up collecting, check out my github project. We need
to adapt the spider now to use the items.

import scrapy
from hots.items import HotsItem

class HotsSpider(scrapy.Spider):
    # Basic spider information
    name = "hots"
    allowed_domains = ["masterleague.net"]
    start_urls = [
        'http://masterleague.net/match/1918/'
    ]

    def parse(self, response):
        item = HotsItem()
        item['match_date'] = response.xpath('/html/body/div/ol/li[6]/text()').extract_first()

        for match in response.xpath('//*[@id="draft"]')
            # Scraping basic match information
            item['map_name'] = match.xpath('div[5]/div[2]/span/text()').extract_first()
            item['home_team'] = match.xpath('div[1]/div[1]/div/div[1]/h2/a/text()').extract_first()
            item['away_team'] = match.xpath('div[1]/div[2]/div/div[2]/h2/a/text()').extract_first()
...
        yield item

If you are unfamiliar with yield in Python, it is similar to return except
that it returns a "generator". A generator is an interable that can only be iterated
over once. If you want more information, check out this stack overflow page

Advancing Pages

The standard way of advancing pages in Scrapy is to identify some button like
'Next Page' through which you can get the url of the next page. Once there is
no more button, you end the scraper. Unfortunately, Master League
only links to the next game in a given series, but not does not link between
series. However, Master League increases the number of each match by 1,
which makes it easy to advance to the next page.

class HotsSpider(scrapy.Spider):
    # Basic spider information
    name = "hots"
    allowed_domains = ["masterleague.net"]
    start_urls = [
        'http://masterleague.net/match/1918/'
    ]

    def parse(self, response):
        item = HotsItem()
        ...
        ...
        prev_id = int(''.join(list(filter(str.isdigit, response.request.url))))
        item['match_id'] = prev_id
        next_page_url = response.request.url[:-5] + str(prev_id+1) + '/' # I will regret this
        yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse)

I was a little lazy though, for two reasons. The first is that I just replaced the
last five characters with the next match ID. Eventually they will hit match 10,000
I will have to fix the code. Second, the code ends whenever there is a 404 error,
which I never bothered to handle. It runs fine despite that though.

Adding Statistics

While the above information allowed me to create my Glicko ratings, I
eventually wanted to be able to do something like Moneyball for the
HotS competitive scene. Luckily, while I was working on this project,
Master League added statistics to their page. The statistics are loaded
using javascript when you click on the statistics tab. Interestingly,
they are just loaded from a webpage with the same URL as the match with
/stats/ added on. To get the data, I check whether the statistics tabs
exist and then use a second function to parse the data on the statistics
page.

import scrapy                    # The basis of everything
import re                        # Want this for extracting game num from url
from hots.items import HotsItem  # The item I created earlier


class HotsSpider(scrapy.Spider):
    # Basic spider information
    name = "hots"
    allowed_domains = ["masterleague.net"]
    start_urls = [
        'http://masterleague.net/match/1918/'
    ]

    def parse(self, response):
        item = HotsItem()

        stat_check = response.xpath('/html/body/div/ul/li[3]/a/text()').extract_first()
        if stat_check == 'Statistics':
            request = scrapy.Request(response.request.url + 'stats/', callback=self.parseStats)
            request.meta['item'] = item
            yield request

The callback is now parseStats, at the end of which I go to the next match.
In order to pass the item with the basic information already in it, the item
is passed through request.meta['item'].

Pipelines

Most of what I learned about pipelines is from a
blog post I found using
Google. In pipelines.py, I created two class to store Items in the database
called HotsPipeline and SQLiteStorePipeline. I also converted the match date to
a more appropriate format and find the number of weeks since the first
match. Most important is to make sure the __init__ and __del__ are set
up so that the connection to the database terminates. Since only one process
can access a SQLite database at a time, this is crucial to being a good person.

import sqlite3 as lite
from datetime import datetime

con = None  # db connection

class HotsPipeline(object):
    def process_item(self, item, spider):
        return item

class SQLiteStorePipeline(object):
    filename = 'HotSCompetitive.sqlite'

    def __init__(self):
        self.setupDBcon()
        self.createMatchTable

    def process_item(self, item, spider):
        self.storeInDBShort(item)
        return item

    def storeInDBShort(self, item):
        # Stores information if there is no replay available to get stats
        # Only uses some of the fields in the tables

        # Calculating the week and adjusting the date for an
        # appropriate format
        self.cur.execute("SELECT date FROM match_basic ORDER BY ROWID ASC LIMIT 1")
        m_date1_str = self.cur.fetchall()[0][0]  # should always have one entry
        m_date1_obj = datetime.strptime(m_date1_str, '%m/%d/%Y')
        m_date2_oj = datetime.strptime(item['match_date'], '%b. %d, %Y')

        # Finding the number of weeks difference and adding one
        # for player ratings
        delta = m_date2_obj - m_date1_obj
        weeks = int(delta.days / 7) + 1  # Remember int rounds down

        # Converting the match date to a correct string
        m_date2_str = datetime.strftime(m_date2_obj, '%m/%d/%Y')

        # Inserting the item into the database
        self.cur.execute("INSERT INTO match_basic(\
        match_id, \
        ...
        away_ban2 \
        ) \
        VALUES( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )",
        (
        item['match_id'],
        m_date2_str,
        ...
        item['away_ban2']
        ))
        self.con.commit()

    def setupDBcon(self):
        self.con = lite.connect('HotSCompetitive.sqlite')
        self.cur = self.con.cursor()

    def __del__(self):
        self.closeDB()

    def createMatchTable(self):
        self.cur.execute("CREATE TABLE IF NOT EXISTS match_basic(\
        match_id INTEGER NOT NULL, \
        ...
        away_ban2 TEXT \
        PRIMARY KEY(match_id) \
        )")
        self.cur.execute("CREATE TABLE IF NOT EXISTS `player_details`(\
        `match_id` INTEGER NOT NULL, \
        ...
        `exp` INTEGER, \
    	PRIMARY KEY(`match_id`) \
        )")

    def closeDB(self):
        self.con.close() # definitely need to make sure conn closed

The only thing missing is that the first row of match_basic has to
exist already in order to calculate the date. While an if statement
could handle this, entering one row into a database I used to take care
of by hand is perfectly fine.

Next Steps

The code so far handles the vast majority of what I wanted it to, but
there are still some small additions left for quality of life
enhancements. The most important is Deltafetch as described on
Scrapy's blog.
The middleware keeps a list of the sites you have crawled so you do not
crawl them twice. This is great for both you and the site you are crawling.

Finally, make you adhere to the outlines in this post
on scraping politely. You don't want to have your IP address banned for
looking like a hacker, nor do you want to accidentally DDOS a website.
If you have a front-facing application, also be careful about any SQL
injection vulnerabilities. Since this will just be hosted on Github,
I was not particularly worried, but it is something you should always
consider.