Web scraping with Python 2.7 (1/2)

Published on 15 May 2014

A lot of companies expose their data to (almost) everyone, mainly through APIs. A « few » years back it wasn’t that common and was mainly used by IT companies that wanted to make their data available so that new apps can emerge from it. Nowadays, it has become a must have for a lot of companies, (former) startups and even governments to propose this kind of access to data.

However, when there isn’t a key-in-hand API what choice is left? Web scraping!

 

General principle of web scraping

Web scraping describes the method of extracting content from a web page in order to reuse it for further use or calculation. Why « Scraping »? Because we have to scrap everything that is not the actual content  (and that we want to extract). The process implies to know the structure of the page you’re scraping: how is the content organized, what is the element/id/class of the item you want to extract, etc.

One would think that extracting text surrounded by tags is easy, well it’s not. Unfortunately HTML isn’t that to deal with…at least with a few lines of code. To comprehend why scraping is not that easy, I will first show an example on how to extract all the links from a page only with basic operations. Secondly, I’ll will do the same thing with the (excellent) BeautifulSoup library.

 

Scraping the old-school way

So I want to write code that is able to differentiate the useful parts from the page structure. To begin with, I have to get our page source code, for this purpose the urllib python library will do the trick:

import urllib

def get_page_content(url):
    try:
        return urllib.urlopen(url).read()
    except:
        return "Error"


res = get_page_content("http://en.wikipedia.org/wiki/Beautiful_Soup")
print res

First step is done: we have our web page in a string. Awesome! I can now easily extract all links from it…or can I…

I assume that all links starts with « <a » and ends with «  » and that the adress is contained within those tags just after « href= »:

def extract_links(page):
    all_links = []
    start_str = "&lt;a"
    end_str = "&lt;/a&gt;"
    while start_str in page:
        start_position = page.find(start_str)
        end_position = page.find(end_str,start_position)
        link = page[start_position:end_position]
        href_pos = link.find('href="')+ len('href="')
        all_links.append(link[href_pos:link.find('"',href_pos)])
        page = page[end_position:]
    return all_links


res = get_page_content("http://en.wikipedia.org/wiki/Beautiful_Soup")
print res
links = extract_links(res)
print links

And here are of few links extracted :

['=',
 '#mw-navigation',
 '#p-search',
 '/wiki/Software_design',
 '/w/index.php?title=Leonard_Richardson&amp;action=edit&amp;redlink=1',
 'http://en.wikipedia.org/w/index.php?title=Beautiful_Soup&amp;oldid=589228809',
 '#', '/wiki/Main_Page', '/wiki/Main_Page', '/wiki/Portal:Contents',...]

Clearly this is not what was expected

Several problems encountered with the links of this page:

  • Empty link
  • Anchors
  • Relative addresses

I added a check on the adress format:

def extract_links(page):
    all_links = []
    start_str = "&lt;a"
    end_str = "&lt;/a&gt;"
    while start_str in page:
        start_position = page.find(start_str)
        end_position = page.find(end_str,start_position)
        link = page[start_position:end_position]
        href_pos = link.find('href="')+ len('href="')
        href_content = link[href_pos:link.find('"',href_pos)]
        #link format check
        if href_content[0] != "#" and href_content != "=":
            if href_content[0] == "/":
                all_links.append("http://en.wikipedia.org"+href_content)
            else:
                all_links.append(href_content)
        page = page[end_position:]
    return all_links

OK let’s face it the function is not that pretty and not very adaptive. Soup time!

 

Scraping with tools (i.e. BeautifulSoup)

BeautifulSoup is a powerfull library to manipulate DOM content. It’s rather straight forward to use, for you can call page elements by their type, class, id. See BeautifulSoup doc. I recommend to install BS from pip which is (normally) the easy way.

Back to business. I still want to extract all the links from my page, which is now much easier. The process is more or less the same: get all the links, check if their « href » attribute is empty, get the value of the « href » attribute and append it to the result array:

import urllib
from BeautifulSoup import BeautifulSoup

def get_page_content(url):
    try:
        return urllib.urlopen(url).read()
    except:
        return "Error"

def extract_links(page):
    all_links = []
    root_url = "http://en.wikipedia.org/"
    soup = BeautifulSoup(page, convertEntities=BeautifulSoup.HTML_ENTITIES)
    links = soup.findAll('a')
    for url in links:
        if url.has_key('href'):
            all_links.append(root_url + url['href'])
    return all_links


res = get_page_content("http://en.wikipedia.org/wiki/Beautiful_Soup")
print res
links = extract_links(res)
print links

And that’s it. Where the first example the main loop was 17 lines long, here it’s only 3 lines and much clearer.

 Few BeautifulSoup examples

Finding an element by id:

soup.find('span', {'id': re.compile(r"^beginningOfMyID")})

Finding an element’s parent:

tag = soup.find('img', {'src' : re.compile('images/myImg.png')}).findParent()

Extracting an element from the page (prevents you from keeping track of start/end position and edits the page in mean time):

tag = soup.find('h1').extract()
comments powered by Disqus