Fun with Online Data - A Static Screen Scrape
One can get lost in the overwhelming wealth of data that the Internet has to offer...hours of time can be spent digging through available open data portals and even more hours spent exploring and looking at ways to combine these sources to create meaningful output. In our last post we gave a quick overview on the needs and ethics behinds performing a screen or web scrape to get data for use in analyses. Check it out our To Scrape or Not to Scrape article.
This time, we'll look into performing a screen or web scrape from a static government site. The site we’ll scrape is the data catalog at http://en.openei.org/datasets/dataset sourced by the National Renewable Energy Laboratory (NREL), a national laboratory of the U.S. Department of Energy.
In this tutorial we'll use Python, an interpreted language with a remarkable set of libraries that provides support for many different activities. For this exercise, we'll incorporate three libraries:
URLLIB2, a library supporting web access
BeautifulSoup, an HTML access library that allows for loose parsing and searching, and
Scrapy, a more abstracted web scraping library
In reviewing the http://en.openei.org/datasets/dataset site, we see the robots.txt file indicates that access is granted to the area that we wish to scrape. In order to make the exercise a little more interesting, we’ll filter the search by sector, and that sector will be solar. Now, I must point out that the search results indicate that the data could be accessed through an API (“You can also access this registry using the API (see API Docs).”), but I find that the links do not provide any information. The first takes you to the API version number (in JSON) and the second goes to a GitHub “Permission Denied” page.
On to the scraping!
The first step in the process is to use the site. I began by clicking on the Solar (224) entry in the Sectors filter box. I watched what happened to the URL as I did this; it went from http://en.openei.org/datasets/dataset to http://en.openei.org/datasets/dataset?sectors=solar, which seemed reasonable. I then scrolled down the page and noted the pagination navigation bar at the end of the results.
Clicking on the link to page 2 gave the following URL: http://en.openei.org/datasets/dataset?sectors=solar&page=2. Clicking back to page 1 gave http://en.openei.org/datasets/dataset?sectors=solar&page=1. I then entered a number greater than the last page shown in the pagination navigation; it showed 12 and I entered 13. A page with no results was returned. This means that our parser can either extract the maximum page from the navigation bar, or it can simply request increasing page numbers until no results are returned. For this particular example, I chose the former.
The next step is to look at the source for the results. We want to extract the list of datasets; the pertinent items being the dataset title, the url, the description, and the formats available. Typically we’d push the data into a database, but for this example we’ll simply save it into a file on the file system. Looking at the source for the results page we see a well formatted, clean source code which makes scraping it much easier. Paging down, we find the start of the results on line 257 (Chrome source view) with the following list start:
<ul class="dataset-list unstyled">
This is immediately followed by the actual dataset result:
<li class="dataset-item">
<div class="dataset-content">
<h3 class="dataset-heading">
<a href="/datasets/dataset/solar-radiation-monitoring-station-sorms-humboldt-state-university-arcata-california-data">Solar Radiation Monitoring Station (SoRMS): Humboldt State University, Arcata...</a>
<!-- Snippet snippets/popular.html start -->
<!-- Snippet snippets/popular.html end -->
</h3>
<div>A partnership with HSU and U.S. Department of Energy's National Renewable Energy Laboratory (NREL) to collect solar data to support future solar power generation in the United...</div>
</div>
<ul class="dataset-resources unstyled">
<li>
<a href="/datasets/dataset/solar-radiation-monitoring-station-sorms-humboldt-state-university-arcata-california-data" class="label" data-format="html">HTML</a>
</li>
</ul>
</li>
Our targets in this are the first anchor tag text (title), the first anchor tag href (url), the inner div (description) and the final anchor tag text in the inner list (formats). At this point, I find it useful to fire up my Python interpreter and start to interact with the page. First, I do the needed imports (urllib2 and BeautifulSoup to start) and then I load the page into a variable:
html = urllib2.urlopen("http://en.openei.org/datasets/dataset?sectors=solar&page=1”)
Then I parse it through BeautifulSoup:
soup = BeautifulSoup(html, "html.parser")
With the soup made, we can now start to explore the document interactively. After that exploration is complete, I grab the working pieces and put the code together:
from bs4 import BeautifulSoup
import codecs
import urllib2
f = codecs.open('OpenEI.txt', 'w', 'utf-8')
pages = 1
page = 0
while page < pages:
html = urllib2.urlopen("http://en.openei.org/datasets/dataset?sectors=solar&page={}".format(page+1))
soup = BeautifulSoup(html, "html.parser")
try:
pages = int(soup.select('.pagination')[0].find_all('a')[-2].text)
except:
pages = 1
data_sets = soup.find_all('li', {'class':'dataset-item'})
for data_set in data_sets:
hrefs = data_set.find_all('a')
downloads = len(hrefs)
formats_list = []
title = hrefs[0].text
url = hrefs[0].get('href')
description = data_set.select('div')[0].select('div')[0].text
for download in range(1,downloads):
formats_list.append(hrefs[download].text)
formats = u', '.join(formats_list)
print u'{}, "{}", "{}", "{}","{}"'.format(page, title, url, description, formats)
f.write(u'{}, "{}", "{}", "{}","{}"\n'.format(page, title, url, description, formats))
page += 1
f.close()
This is the complete code to fetch the page, parse it, and then extract the desired information. As with all code that interacts with the world, there were some challenges to be overcome. The one that I often stumble over is unicode. Python 3 defaults to unicode throughout, but Python 2 does not, so one must be explicit. Since the UTF-8 encoding is one of the more popular on the web, we have to handle unicode processing. In the above code, I was having a terrible time with the line:
print u'{}, "{}", "{}", "{}","{}"'.format(page, title, url, description, formats)
What I had initially coded was:
print '{}, "{}", "{}", "{}","{}"'.format(page, title, url, description, formats)
Which instructs Python to put the page, title, url, description and formats into the string to be printed. Now, the latter four are all unicode strings, so there is an implicit encoding of the unicode to ascii which will work until a character is found that does not have an ascii representation at which point the code fails. Adding the u preceding the string results in the string being unicode, so no encoding is needed, and the code succeeds.
The use of codecs to save the file is another result of the use of unicode.
You’ll notice that I parse the total number of pages every loop. This is one of those choices where everything feels odd, and it results from the decision to parse out the total pages rather than request pages until a page is returned that contains no datasets. In the above code I do one less page request from the server which is countered by the increased parsing. I considered using a “page number parsed” flag to only parse the page number once, but decided against the clutter.
The last item to mention regards the manner that the available formats are extracted from the page. Initially I simply added the text strings together with a comma to create formats entry. The problem was that I had a trailing comma. I added code to trim it off the string but had some problems with a dataset entry without any available formats. I then took a step back and thought about a more Pythonic way to code it. String.join() to the rescue! It takes the iterable list and joins the list elements using the string:
u’, ‘.join([‘a’,’b’,’c’,’d’]) results in ‘a, b, c, d’
No concerns about trailing commas! Also note that we are using a unicode string to perform the join.
Next time we’ll look at scraping a dynamic site.Stay tuned for our follow-on tutorial or subscribe below to ensure our update is sent directly to your email inbox, subscribe to our monthly TechTalk newsletter.
neXus Data Solutions is a Software Development Company in Anchorage, Alaska. Bob is our in-house tech lead with over 25 years experience in software design and development. He is neXus Data Solutions' Employee Number One who helps to drive our technical solutions using an innovative yet practical approach.