This section demonstrates uses of more advanced PiCloud functionality.
Consider building a scraper to extract the prices of every laptop model found by Google shopping. The goal is to write a single file listing the price and model of the first 1,800 laptops returned.
The initial script may look like the following:
import time
import urllib2
import re
import cloud
max_start = 1800
increment_start = 10
URL = "http://www.google.com/products?q=laptop&start=%d" #search query
#regular expression to match model, price:
matcher = re.compile("<li class=\"result\".*?title=\"(.*?)\".*?<span class=\"main-price\">\$([0-9.,]*)</span>.*?</li>", flags = re.DOTALL | re.IGNORECASE)
def savePrices():
f = open("prices.txt","w")
f.write('price, item\n')
for start in range(0,max_start,increment_start):
matches = scrape(URL % start) #generate url and scrape
if matches:
for match in matches:
f.write("$%s, %s\n"% (match[1], match[0])) #write $price, model
f.close()
def scrape(url):
site = urllib2.urlopen(url)
contents = site.read()
site.close()
matches = matcher.findall(contents)
return matches
if __name__ == "__main__":
savePrices()
The output looks something like this:
price, item
$393, HP Mini 311-1000NR - Atom 1.6 GHz - 11.6 " - 1 GB Ram - 160 GB HDD
$642, Dell Latitude E5400 - Core 2 Duo 2 GHz - 14.1 " - 2 GB Ram - 160 GB HDD
$920, Apple MacBook - Core 2 Duo 2.26 GHz - 13.3 " - 2 GB Ram - 250 GB HDD
$400, Acer Aspire 1410-2954 - Celeron M 1.3 GHz - 11.6 " - 2 GB Ram - 250 GB HDD
...
Due to the serial nature of this script, it will about 5 and a half minutes to complete on a typical home Internet connection.
To speed it up, let us try to use Picloud. We want to cloud.call() the scraping code, so PiCloud will make the url requests and match the regular expressions. This code assumes that api_key and api_secretkey have already been set in cloudconf.py.
Let us modify savePrices as follows:
import cloud
def savePrices():
f = open("prices.txt","w")
f.write('price,item')
for start in range(0,max_start,increment_start):
jid = cloud.call(scrape, URL % start) #Tell PiCloud to run scrape(URL%scrape)
matches = cloud.result(jid) #Block and get result of the scrape job
if matches:
for match in matches:
f.write("$%s, %s\n"% (match[1], match[0])) #write $price, model
f.close()
This code works, but it incorrectly uses PiCloud. In particular, it will take about 13 minutes to complete, because the code fails to leverage Picloud’s parallelism or low latency. Only one scrape occurs at a time AND huge additional overhead is incurred (relative to the initial script) by talking with PiCloud.
As seen in the earlier Examples, the correct way to use PiCloud is to send it multiple jobs simultaneously. Blocking functions such as cloud.result() and cloud.join() should not be used until they must be. We further can make use of cloud.map() in lieu of multiple cloud.call(), as well as use cloud.iresult() to iterate through results in order as they complete. Note that list comprehensions make the map transformation especially intuitive.
The result of this modification is:
def savePrices():
f = open("prices.txt","w")
f.write('price,item')
#map scrape onto list of urls which are generated with a list comprehension
jids = cloud.map(scrape, [URL % start for start in range(0,max_start,increment_start)])
#write file as jobs complete
for matches in cloud.iresult(jids):
if matches:
for match in matches:
f.write("$%s, %s\n"% (match[1], match[0])) #write $price, model
f.close()
We now are taking advantage of PiCloud well. With 40 parallel compute units available, this code runs in about 20 seconds.
There is significant per-job overhead (about 400 ms) within PiCloud. As you may not obtain 100% parallelism, it makes sense to do more per job. In particular, scrape takes about two seconds to execute, so the overhead is significant.
Consequently, we will chunk cloud.map() jobs. By chunking, we mean merging several jobs into a single one. Within that single job, the builtin map function is used to execute what would otherwise be separate jobs. An example should make chunking more clear:
def savePrices():
f = open("prices.txt","w")
f.write('price,item\n')
chunksize = 5 #6 or higher is suboptimal
#outer list is a list of inner lists. Each inner list has chunksize elements.
#The output list looks like [[url%0, url%1, ... url%5], [url%6, ... url%11], ... [... url%1799]]
# for a chunksize of 5 and max_start of 1800.
chunked_arguments = [[URL % j for j in range (i,i+chunksize*increment_start) ] for i in range(0,max_start,chunksize*increment_start)]
#each chunk_scrape function will receive a list (inner list) of chunksize URLS to map scrape to
jids = cloud.map(chunk_scrape, chunked_arguments)
for outer_matches in cloud.iresult(jids): #Get result of all jobs. This is a list of list of results
for matches in outer_matches:
#write file as jobs complete
if matches:
for match in matches:
f.write("$%s, %s\n"% (match[1], match[0])) #write $price, model
f.close()
def chunk_scrape(args):
"""Map scrape to arguments (args) on the server"""
return map(scrape, args)
With this code, there are 5-fold fewer jobs created. However, each job does 5 times as much work (getting urls 0 to 4 rather than just 0). Consequently, total overhead is reduced, allowing scraping to complete a bit faster - 18 seconds.
Note that there are 180 total scrapes. If we expect to be able to have 40 jobs running in parallel, the chunksize should be less than or equal to 5.
Chunking quick jobs is critical to exploiting PiCloud efficiently.
Warning
Running this test may produce sub-par results sometimes due to aggressive rate throttling by Google.