This section provides an introduction to writing “practical” PiCloud applications.
One of the most important uses of modern computers (certainly more important than surfing Facebook or YouTube) is calculating pi. Since PiCloud is named for this wonderful constant, our first example is a program that leverages PiCloud to “estimate” pi.
Consider the above image of a circle inscribed within a square. Note that the gray region (the upper right quadrant) holds exactly a quarter of both the inscribed circle and the square. If we set the radius of the circle equal to one unit, the area of the quadrant will be one, and the area of the inscribed circle within the gray area will be a quarter of pi.
Now we are going to simulate the tossing of millions of infinitely small darts at the upper right quadrant of a square, as shown in the below image. We expect the ratio of the darts within the circle (the red) to all darts tossed to be equal to the ratio of the areas – a quarter of pi.
Incidentally, using such random sampling to calculate a result is known as a Monte Carlo Method
The initial script may look like:
import random
numTests = 500000000
def monteCarlo(num_test):
"""
Throw num_test darts at a square
Return how many appear within the quarter circle
"""
numInCircle = 0
for _ in xrange(num_test):
x = random.random()
y = random.random()
if x*x + y*y < 1.0: #within the quarter circle
numInCircle += 1
return numInCircle
def calcPi():
numTests = 10000000
numInCircle = monteCarlo(numTests)
pi = (4 * numInCircle) / float(numTests)
return pi
if __name__ == '__main__':
pi = calcPi()
print 'Pi determined to be %s' % pi
Simulating the tossing of a five hundred million darts will take about five minutes and produces a result accurate to 5 significant figures (with some high probability). Note that actually tossing this many darts would take significantly longer.
To speed it up, let us use PiCloud. We want to cloud.call() the monteCarlo code, so PiCloud will run the randomized dart tosses. This code assumes that api_key and api_secretkey have already been set in cloudconf.py. We also set _high_cpu to True, as calculating pi consists entirely of CPU bound operations (if, however, we “calculated” pi by downloading the constant from a research website, that would be an I/O bound operation and _high_cpu would be kept False).
Let us modify calcPi as follows:
import cloud
def calcPi():
jid = cloud.call(monteCarlo,numTests, _high_cpu=True) #Tell PiCloud to run monteCarlo(numTests) -> return jid (job identifier)
numInCircle = cloud.result(jid) #Block until Job (monteCarlo) is done and return result
pi = (4 * numInCircle) / float(numTests)
return pi
This code works, but it incorrectly uses PiCloud. In particular, it will still take about five minutes to complete. When working with computationally intensive work, it is important to leverage Picloud’s parallelism.
The correct way to use PiCloud is to send it multiple jobs simultaneously. It is critical that blocking functions such as cloud.result() and cloud.join() are not used until they must be.
To do this, execute all of the cloud.call() first. A simple modification follows:
import cloud
def calcPi():
num_parallel = 8
testsPerCall = numTests/num_parallel
jids = []
for _ in range(num_parallel):
jids.append( cloud.call(monteCarlo,testsPerCall,_high_cpu=True) ) #invoking call in parallel; remember, it does not block
numInCircle = 0
for jid in jids:
numInCircle += cloud.result(jid) #Block until result of Job specified by jid is ready
pi = (4 * numInCircle) / float(numTests)
return pi
With 8 calls running in parallel, this takes just over a minute to execute.
Two operations can be easily done to reduce network overhead.
The result of this modification is:
import cloud
def calcPi():
num_parallel = 8
testsPerCall = numTests/num_parallel
jids = cloud.map(monteCarlo,[testsPerCall]*num_parallel, _high_cpu=True) #argument list has 8 duplicate elements
numInCircleList = cloud.result(jids) #get list of counts
numInCircle = sum(numInCircleList) #add the list together
pi = (4 * numInCircle) / float(numTests)
return pi
We now are taking advantage of PiCloud well. This code takes just under a minute to execute.
Using cloud.call() alnoe may be inappropriate if the target function needs to utilize significant amounts of data. cloud.files can be used to transfer data to and from PiCloud directly.
We are going to use the Python Imaging Library for a rather simple task: generating thumbnails of all JPEG images in the working directory:
import Image
import glob, os
from cStringIO import StringIO
sizes = [(128,128), (64,64), (32, 32)] #thumbnail sizes to generate
def resizeImage(img_name):
"""Resize Image and write thumbnails to disk
Returns a list of thumbnail names written"""
thumb_names = []
name_start, ext = os.path.splitext(img_name)
img_seekable = open(img_name) #leave file open
for size in sizes:
img_seekable.seek(0)
img = Image.open(img_seekable) #open image
img.thumbnail(size, Image.ANTIALIAS)#thumbnail image
thumb_name = name_start+'.thumbnail.'+str(size[0])+'.jpg'
img.save(thumb_name,"JPEG") #Write image to file
thumb_names.append(thumb_name)
img_seekable.close()
return thumb_names
def main():
"""
Entry point
"""
thumb_names_list = map(resizeImage,glob.glob("*.jpg")) #write thumbnails for all .jpg in current dirs
thumb_names = reduce(lambda x,y: x+y, thumb_names_list) #merge lists of thumbnails
print 'Generated thumbnails: ' + str(thumb_names)
if __name__ == '__main__':
main()
Unlike the previous example, thumbnail generation is a bit less trivial to run on PiCloud. The problem is that PiCloud does not have access to your local file system. Consequently a resizeImage running on the Cloud cannot open files.
The naive way around this problem is to pass Image objects directly to and from PiCloud. Unfortunately, the substantial changes result in inefficient code:
import Image
import glob, os
from cStringIO import StringIO
sizes = [(128,128), (64,64), (32, 32)] #thumbnail sizes to generate
import cloud
def resizeImage(base_img):
"""Accepts an image - returns resized images"""
thumb_imgs = []
for size in sizes:
img = base_img.copy() #duplicate image to minipulate it
img.thumbnail(size, Image.ANTIALIAS)#thumbnail image
thumb_imgs.append(img)
return thumb_imgs
def main():
"""
Entry point
"""
base_images = map(Image.open, glob.glob("*.jpg")) #load all images
jids = cloud.map(resizeImage, base_images) #Send them to PiCloud for resizing
thumb_imgs_list = cloud.result(jids) #Get thumbnails from PiCloud
thumb_names = []
for img, thumbs in zip(base_images,thumb_imgs_list):
img_name = img.fp.name
name_start, ext = os.path.splitext(img_name)
for size, thumb in zip(sizes, thumbs): #save each thumbnail
thumb_name = name_start+'.thumbnail.'+str(size[0])+'.jpg'
thumb.save(thumb_name, "JPEG")
thumb_names.append(thumb_name)
print 'Generated thumbnails: ' + str(thumb_names)
if __name__ == '__main__':
main()
In particular, if we ever wish to make small modifications to this code, the source images will be unnecessarily retransmitted to PiCloud. We also download all resultant thumbnails, even when it may not be necessary. For projects with large data, this solution is quite poor.
The ideal solution is to use cloud.files as a shared file system.
First run a script locally one-time to send the images to PiCloud:
import cloud, glob
if __name__ == '__main__':
"""
Push all *.jpg in local directory to cloud
"""
image_list = glob.glob("*.jpg")
for img_file in image_list:
cloud.files.put(img_file) #push file to cloud
Now with the images on PiCloud, we can manipulate them:
import Image
import glob, os
from cStringIO import StringIO
sizes = [(128,128), (64,64), (32, 32)] #thumbnail sizes to generate
import cloud
def resizeImage(img_name):
"""Return a list of thumbnails. The thumbnails can be retrieved with cloud.files.get"""
thumb_names = []
imgf = cloud.files.getf(img_name) #request image from cloud as a file object
img_seekable = StringIO(imgf.read()) #CloudFile object cannot seek(), so it is buffered into memory
name_start, ext = os.path.splitext(img_name)
for size in sizes:
img_seekable.seek(0)
img = Image.open(img_seekable) #open image
img.thumbnail(size, Image.ANTIALIAS)#thumbnail image
outFile = StringIO()
img.save(outFile,"JPEG") #Write image to memory
outFile.seek(0)
thumb_name = name_start+'.thumbnail.'+str(size[0])+'.jpg'
cloud.files.putf(outFile,thumb_name)#Write thumbnail file to Cloud
thumb_names.append(thumb_name)
return thumb_names
def main():
"""
Entry point
Assume all *.jpg in local directory have already been pushed to cloud
"""
jids = cloud.map(resizeImage,glob.glob("*.jpg"))
thumb_names_list = cloud.result(jids) #a list of lists of thumbnails
thumb_names = reduce(lambda x,y: x+y, thumb_names_list) #merge lists
print 'Generated thumbnails: ' + str(thumb_names)
#pull files from cloud
for thumb_name in thumb_names:
cloud.files.get(thumb_name, thumb_name) #write file locally
if __name__ == '__main__':
main()
Note how much more similar this code is to the original relative to the “incorrect” version. resizeImage runs on PiCloud and uses cloud.files to retrieve the source image. It then pushes the generated thumbnails back into cloud.files. As it returns the name of the new thumbnails, the client can, in main, retrieve them through cloud.files.
Note that with such a system the client need not transmit the source image every time resizeImage is run. This optimization is even more important when many jobs are processed on the same data.
With cloud.files.exists() or cloud.files.list(), we can push new data automatically. The following code at the start of main will accomplish this:
cloud_files = cloud.files.list() #list all files on PiCloud
img_files = glob.glob("*.jpg")
for img_file in img_files:
if img_file not in cloud_files:
cloud.files.put(img_file) #push file not already on PiCloud
jids = cloud.map(resizeImage,img_files)
#continue as usual
Note that this code cannot detect changed data; you will need to manage that yourself.