Examples

This section provides an introduction to writing “practical” PiCloud applications.

Calculating Pi

One of the most important uses of modern computers (certainly more important than surfing Facebook or YouTube) is calculating pi. Since PiCloud is named for this wonderful constant, our first example is a program that leverages PiCloud to “estimate” pi.

_images/basic_example_area.png

Consider the above image of a circle inscribed within a square. Note that the gray region (the upper right quadrant) holds exactly a quarter of both the inscribed circle and the square. If we set the radius of the circle equal to one unit, the area of the quadrant will be one, and the area of the inscribed circle within the gray area will be a quarter of pi.

Now we are going to simulate the tossing of millions of infinitely small darts at the upper right quadrant of a square, as shown in the below image. We expect the ratio of the darts within the circle (the red) to all darts tossed to be equal to the ratio of the areas – a quarter of pi.

_images/basic_example_monte.gif

Incidentally, using such random sampling to calculate a result is known as a Monte Carlo Method

Initial script

The initial script may look like:

import random
numTests = 500000000

def monteCarlo(num_test):
  """
  Throw num_test darts at a square
  Return how many appear within the quarter circle
  """
  numInCircle = 0
  for _ in xrange(num_test):
    x = random.random()
    y = random.random()
    if x*x + y*y < 1.0:  #within the quarter circle
      numInCircle += 1
  return numInCircle


def calcPi():
  numTests = 10000000
  numInCircle = monteCarlo(numTests)
  pi = (4 * numInCircle) / float(numTests)
  return pi

if __name__ == '__main__':
  pi = calcPi()
  print 'Pi determined to be %s' % pi

Simulating the tossing of a five hundred million darts will take about five minutes and produces a result accurate to 5 significant figures (with some high probability). Note that actually tossing this many darts would take significantly longer.

Incorrect

To speed it up, let us use PiCloud. We want to cloud.call() the monteCarlo code, so PiCloud will run the randomized dart tosses. This code assumes that api_key and api_secretkey have already been set in cloudconf.py. We also set _high_cpu to True, as calculating pi consists entirely of CPU bound operations (if, however, we “calculated” pi by downloading the constant from a research website, that would be an I/O bound operation and _high_cpu would be kept False).

Let us modify calcPi as follows:

import cloud
def calcPi():
  jid = cloud.call(monteCarlo,numTests, _high_cpu=True)  #Tell PiCloud to run monteCarlo(numTests) -> return jid (job identifier)
  numInCircle = cloud.result(jid)                        #Block until Job (monteCarlo) is done and return result
  pi = (4 * numInCircle) / float(numTests)
  return pi

This code works, but it incorrectly uses PiCloud. In particular, it will still take about five minutes to complete. When working with computationally intensive work, it is important to leverage Picloud’s parallelism.

Almost Correct

The correct way to use PiCloud is to send it multiple jobs simultaneously. It is critical that blocking functions such as cloud.result() and cloud.join() are not used until they must be.

To do this, execute all of the cloud.call() first. A simple modification follows:

import cloud
def calcPi():
  num_parallel = 8
  testsPerCall = numTests/num_parallel
  jids = []
  for _ in range(num_parallel):
    jids.append( cloud.call(monteCarlo,testsPerCall,_high_cpu=True) )   #invoking call in parallel; remember, it does not block
  numInCircle = 0
  for jid in jids:
    numInCircle += cloud.result(jid)                                    #Block until result of Job specified by jid is ready
  pi = (4 * numInCircle) / float(numTests)
  return pi

With 8 calls running in parallel, this takes just over a minute to execute.

Correct

Two operations can be easily done to reduce network overhead.

  1. cloud.result() can take a list of jids. Rather than making 8 result requests, we can make a single one for 8 results.
  2. Every single cloud.call() has the same function argument. Rather than making 8 cloud.call() requests, a single cloud.map() can be used with a 8 element argument list. The map instruction will cause 8 monteCarlo functions to run in parallel (one for each list element).

The result of this modification is:

import cloud
def calcPi():
  num_parallel = 8
  testsPerCall = numTests/num_parallel
  jids = cloud.map(monteCarlo,[testsPerCall]*num_parallel, _high_cpu=True)  #argument list has 8 duplicate elements
  numInCircleList = cloud.result(jids) #get list of counts
  numInCircle = sum(numInCircleList)   #add the list together
  pi = (4 * numInCircle) / float(numTests)
  return pi

We now are taking advantage of PiCloud well. This code takes just under a minute to execute.

Cloud Files

Using cloud.call() alnoe may be inappropriate if the target function needs to utilize significant amounts of data. cloud.files can be used to transfer data to and from PiCloud directly.

Initial script

We are going to use the Python Imaging Library for a rather simple task: generating thumbnails of all JPEG images in the working directory:

import Image
import glob, os
from cStringIO import StringIO

sizes = [(128,128), (64,64), (32, 32)]  #thumbnail sizes to generate

def resizeImage(img_name):
  """Resize Image and write thumbnails to disk
  Returns a list of thumbnail names written"""
  thumb_names = []
  name_start, ext = os.path.splitext(img_name)
  img_seekable = open(img_name)         #leave file open
  for size in sizes:
    img_seekable.seek(0)
    img = Image.open(img_seekable)      #open image
    img.thumbnail(size, Image.ANTIALIAS)#thumbnail image
    thumb_name = name_start+'.thumbnail.'+str(size[0])+'.jpg'
    img.save(thumb_name,"JPEG")         #Write image to file
    thumb_names.append(thumb_name)
  img_seekable.close()
  return thumb_names

def main():
  """
  Entry point
  """
  thumb_names_list = map(resizeImage,glob.glob("*.jpg"))  #write thumbnails for all .jpg in current dirs
  thumb_names = reduce(lambda x,y: x+y, thumb_names_list) #merge lists of thumbnails
  print 'Generated thumbnails: ' + str(thumb_names)

if __name__ == '__main__':
  main()

Incorrect

Unlike the previous example, thumbnail generation is a bit less trivial to run on PiCloud. The problem is that PiCloud does not have access to your local file system. Consequently a resizeImage running on the Cloud cannot open files.

The naive way around this problem is to pass Image objects directly to and from PiCloud. Unfortunately, the substantial changes result in inefficient code:

import Image
import glob, os
from cStringIO import StringIO

sizes = [(128,128), (64,64), (32, 32)]  #thumbnail sizes to generate

import cloud

def resizeImage(base_img):
  """Accepts an image - returns resized images"""
  thumb_imgs = []
  for size in sizes:
    img = base_img.copy()               #duplicate image to minipulate it
    img.thumbnail(size, Image.ANTIALIAS)#thumbnail image
    thumb_imgs.append(img)

  return thumb_imgs

def main():
  """
  Entry point
  """
  base_images = map(Image.open, glob.glob("*.jpg"))     #load all images
  jids = cloud.map(resizeImage, base_images)            #Send them to PiCloud for resizing
  thumb_imgs_list = cloud.result(jids)                  #Get thumbnails from PiCloud

  thumb_names = []
  for img, thumbs in zip(base_images,thumb_imgs_list):
    img_name = img.fp.name
    name_start, ext = os.path.splitext(img_name)

    for size, thumb in zip(sizes, thumbs):              #save each thumbnail
      thumb_name = name_start+'.thumbnail.'+str(size[0])+'.jpg'
      thumb.save(thumb_name, "JPEG")
      thumb_names.append(thumb_name)

  print 'Generated thumbnails: ' + str(thumb_names)

if __name__ == '__main__':
  main()

In particular, if we ever wish to make small modifications to this code, the source images will be unnecessarily retransmitted to PiCloud. We also download all resultant thumbnails, even when it may not be necessary. For projects with large data, this solution is quite poor.

Correct

The ideal solution is to use cloud.files as a shared file system.

First run a script locally one-time to send the images to PiCloud:

import cloud, glob

if __name__ == '__main__':
  """
  Push all *.jpg in local directory to cloud
  """
  image_list = glob.glob("*.jpg")
  for img_file in image_list:
    cloud.files.put(img_file) #push file to cloud

Now with the images on PiCloud, we can manipulate them:

import Image
import glob, os
from cStringIO import StringIO

sizes = [(128,128), (64,64), (32, 32)]  #thumbnail sizes to generate

import cloud

def resizeImage(img_name):
  """Return a list of thumbnails. The thumbnails can be retrieved with cloud.files.get"""
  thumb_names = []
  imgf = cloud.files.getf(img_name)     #request image from cloud as a file object
  img_seekable = StringIO(imgf.read())  #CloudFile object cannot seek(), so it is buffered into memory
  name_start, ext = os.path.splitext(img_name)
  for size in sizes:
    img_seekable.seek(0)
    img = Image.open(img_seekable)      #open image
    img.thumbnail(size, Image.ANTIALIAS)#thumbnail image
    outFile = StringIO()
    img.save(outFile,"JPEG")            #Write image to memory
    outFile.seek(0)
    thumb_name = name_start+'.thumbnail.'+str(size[0])+'.jpg'
    cloud.files.putf(outFile,thumb_name)#Write thumbnail file to Cloud
    thumb_names.append(thumb_name)
  return thumb_names

def main():
  """
  Entry point
  Assume all *.jpg in local directory have already been pushed to cloud
  """
  jids = cloud.map(resizeImage,glob.glob("*.jpg"))
  thumb_names_list = cloud.result(jids)         #a list of lists of thumbnails
  thumb_names = reduce(lambda x,y: x+y, thumb_names_list)       #merge lists
  print 'Generated thumbnails: ' + str(thumb_names)

  #pull files from cloud
  for thumb_name in thumb_names:
    cloud.files.get(thumb_name, thumb_name) #write file locally

if __name__ == '__main__':
  main()

Note how much more similar this code is to the original relative to the “incorrect” version. resizeImage runs on PiCloud and uses cloud.files to retrieve the source image. It then pushes the generated thumbnails back into cloud.files. As it returns the name of the new thumbnails, the client can, in main, retrieve them through cloud.files.

Note that with such a system the client need not transmit the source image every time resizeImage is run. This optimization is even more important when many jobs are processed on the same data.

Even Better

With cloud.files.exists() or cloud.files.list(), we can push new data automatically. The following code at the start of main will accomplish this:

cloud_files = cloud.files.list()        #list all files on PiCloud
img_files = glob.glob("*.jpg")
for img_file in img_files:
  if img_file not in cloud_files:
    cloud.files.put(img_file)           #push file not already on PiCloud

jids = cloud.map(resizeImage,img_files)
#continue as usual

Note that this code cannot detect changed data; you will need to manage that yourself.

Table Of Contents

Previous topic

Client Basics

Next topic

Technical Overview