Technical Overview

This system overview assumes that you have familiarized yourself with PiCloud by reading through the Client Basics and Examples.

General Overview

PiCloud consists of both a cluster of servers (the cloud) and a Python package that allows clients to interoperate with those servers.

Client

The PiCloud client, cloud, is a Python package which allows users of PiCloud to run almost any functions defined within Python on PiCloud’s cloud. It has a variety of tasks, including passing commands to and from PiCloud’s service, (de)serializing data, caching results, introspecting code, and providing an additional debugging interface. More details are described below.

Server

PiCloud’s server architecture consists of a cluster of highly scalable components, including databases, load balancers, distributed file systems and “worker machines”. These components work together to deliver PiCloud’s job execution, monitoring, and auto-scaling platform. PiCloud abstracts all of these away, so you can run code on the cloud without worrying about the details of server management.

Worker

Each job (a function and its arguments) is sent to PiCloud and evaluated on a “worker”. The worker is an Amazon EC2 virtualized instance running a 64 bit version of Ubuntu 10.04 LTS.

A job is run by executing it within a Python interpreter. Python 2.x functions are run within the CPython2.6 interpreter and Python 3.x functions are run within the CPython 3.1 interpreter. A substantial amount of Python packages are installed on the worker and are pre-loaded before jobs are executed: the list may be found here. Packages not pre-installed are handled by the mechanisms described below.

Note

On PiCloud, there is no ‘__main__’. The entry point is the start of a function passed in through cloud.call or cloud.map.

Workers are multi-tenant in that multiple users’ jobs are run simultaneously, sandboxed within their own interpreter. To optimize job startup time, your jobs will sometimes run in the same interpreter as your previous jobs. If you change your source code (client-side), all of your held-open interpreters will shut down, ensuring that jobs are run under the correct code.

Additional security is provided via POSIX permissions and AppArmor . Available CPU power is split amongst the different jobs, see the high cpu option for more detail. All jobs are guaranteed at least their requested CPU power (by default, 1 Amazon EC2 compute unit, equivalent to to a 1-1.2 GHz 2007 Xeon), but may receive more if resources are available.

Scheduling

Most of the time, there will be space available on worker machines to run your job immediately after its creation. However, when there is not, the job will be placed in our queue. As the queue gets larger, PiCloud adds additional workers machines to handle the load; this auto-scaling allows more of your jobs to run in parallel. Be aware that this is a common queue that holds all jobs submitted by all users; consequently your jobs wait not only on your own jobs to complete but also those of other users.

If you need to guarantee immediate execution, you can purchase Real-Time compute units by the hour. With N Real-Time units, you will always be able to run at least N compute units of jobs in parallel. Real-Time units effectively reserve workers for you. As your jobs come in, they are placed first on your personal workers. Once those workers fill up, an additional job will wait until either a personal or shared worker is free (whichever comes first).

Language Integration

cloud allows even the most complex Python functions to run on PiCloud. cloud will analyze functions passed in via cloud.call() or cloud.map() to determine what additional information must be sent. What is going on is best explained via an example:

import sys
import mymodule #some user module that defines a function 'bar'
k = 2

def foo(x):
  return sys.version, mymodule.bar(k + x)

jid = cloud.call(foo,20)

To evaluate foo on PiCloud, several pieces of information need to be known:

  1. The function code of foo
  2. The value of the global variable k
  3. This contents of modules sys and mymodule.

cloud handles the function code by serializing it with its extended pickler. The pickler analyzes function code to identify what variables are used. Those identified variables are in turn transmitted to PiCloud; in this case cloud will detect that global variable k is used and will consequently transmit the value of 2.

Modules are handled in a similar way. The function is analyzed to see what references to modules it has: in the above example, sys and mymodule. Modules that PiCloud already has installed (e.g. sys) are ignored, but ones that are not (mymodule) are analyzed. cloud searches code within non-installed modules (mymodule) for import statements; cloud recursively transmits and analyzes any custom modules found. Transmitted modules are cached on PiCloud and updated whenever changed locally.

Note

All modules are evaluated on PiCloud. Consequently, in the above example, cloud.result(jid)[0] will return the Python interpreter version running on PiCloud, not the one running locally.

C extensions

While the cloud module is able to transmit any code written in Python, the same is not true with custom extensions written in C/C++. Custom extensions cannot be automatically uploaded to PiCloud.

If you use an extension not already installed on PiCloud, you will need to upload it via the web interface.

Data flow Model

PiCloud has been designed for highly parallelizable tasks that have little interaction between each other. The overwhelming majority of problems fall into this category.

To make the optimal use of PiCloud, it is paramount to understand the underlying data flow model, which will be discussed in this section through the use of several diagrams.

Basics

This section will discuss the simplest, and most common, use case of PiCloud - where jobs do not communicate with any external data sources (cloud.files is considered an external data source).

Under this model, jobs running on PiCloud receive data only before executing in the form of the function and its arguments. The only data transmitted is the return value of the function (as well as standard out and error). Jobs do not communicate with each other or any external source. Any side-effects of the job (e.g. modifying global variables) is not seen by the client. This model is a slightly more general case of functional programming

Consider the following diagram, which shows the interaction between a client and PiCloud. Here, the client has generated Job 1 via a single cloud.call() argument, represented by the blue arrow.

_images/base_case.PNG

The red arrow from Job 1 to the client indicates that the client knows the jid (Job Identifier) of Job 1; consequently, it can access information about the job (status, result, etc.) and even manipulate the job (e.g. kill, delete).

Note the lack of a red arrow from Job 1 to itself. There is none because Job 1 is not allowed to interact with itself (it wouldn’t make much sense to access your own result while you are running!). In fact, Job 1 cannot retrieve its own jid.

Jobs are just another Client

A key part of PiCloud’s transparency is the little difference between functions running on the client and on PiCloud. Indeed, jobs running on PiCloud can make use of cloud and even spawn new jobs. Consider the following code:

def job1():
        jid2 = cloud.call(job2)
        return jid2 #returns jid of job2!

def job2():
        return 1

if __name__ == '__main__': #Client entry point
        jid1 = cloud.call(job1)  #jid of job1
        jid2 = cloud.result(jid1) #job1 returns jid of job2

Here Job 1 has used cloud.call() to create another job, Job 2. This can be represented with the following diagram:

_images/subjob_passthrough_2.PNG

Note that even though job1() was running on PiCloud, it could, like the client, create additional jobs. As the digram shows, Job 1 can trivially access information about Job 2, as cloud.call() returned Job 2’s jid. Because (and only because) Job 1 in turn returns Job 2’s jid, the client can also access Job 2, as represented by the arched arrow.

Job 2, however, cannot access information about Job 1. As explained earlier, jobs only receive data via their arguments. As Job 1 does not know its own jid, it cannot not pass its jid to Job 2. The attentive reader will note that if Job 2 could access the result of Job 1, deadlock could occur (by jobs 1 and 2 accessing each others’ result).

A similar pattern exists when one job is spawned after another from the client. We can pass the jid from the first job into the second:

def job1():
        return 1

def job2(jid):
        return cloud.result(jid)

if __name__ == '__main__': #Client entry point
        jid1 = cloud.call(job1)  #jid of job1
        jid2 = cloud.call(job2, jid1)
        ret = cloud.result(jid2) #the result of both

Note

This particular code can be optimized by making job2 dependent on job1, as described in the Dependencies section.

The diagram describing the interaction is shown below:

_images/job_parallel_call.PNG

Note that there is no way for job1 to access job2.

Maps

When making a cloud.map() call, multiple jobs are generated simultaneously. For instance, the code:

jids = cloud.map(func,[4,5])
ret = cloud.result(jids)

can be represented with the following diagram, where one job evaluates func(4) and the other func(5):

_images/job_map.PNG

Parallel map jobs are independent of each other and cannot exchange information.

Take Away

The following diagram is the result of the previous examples being merged together. Jobs are created in numeric order. The blue arrows indicate a cloud.call() and the purple arrows indicate a cloud.map(). The source of the red arrow can be accessed by the destination (the bold arrows indicate a direct case, the smaller arrows indicate that access is possible if jids are passed around in the creating jobs’ return values).

_images/job_complex.PNG

This diagram is rather complex, but note how the red arrows form a directed acyclic graph. This property ensures that it is impossible to ever go into deadlock when using the blocking cloud.result().

Dealing with Data

PiCloud’s functional programming model is very powerful, but you may wish to be able to directly read and write data over a network. The easiest way to share data between the client and jobs is with cloud.files. Several other options exist:

Passing through Files

In general, jobs running on PiCloud have no access to the client’s file system. However, read-only files will pass through; the following code is a naive way to transport files from the client to PiCloud:

def foo(f):
        contents = f.read()
        #do more

f = open('data.txt','r')
cloud.call(foo,f)

While this code works as expected, the file isn’t actually copied to PiCloud’s storage. Instead, the file is read into memory as a StringIO and sent to PiCloud where it will also live in memory. It is not cached, so data.txt is retransmitted on every call.

Again, the file cannot be open for writing: you can read more about the limitations here.

File System

PiCloud also provides access to the file system that jobs are running on. However, due to security controls, both reading and writing are somewhat restricted.

Most areas of the file system remain readable. Writing is more limited; three areas are writable - temporary storage, your home directory, and your shared folder.

  1. You can write temporary files with Python’s tempfile interface. Note /tmp is not writable. Temporary files will not be seen by your other jobs.
  2. Your home directory can be accessed via os.environ['HOME'] or os.path.expanduser(~). Contents written here will not be seen by your other jobs and should also be viewed as temporary.
  3. You are provided with a shared folder inside your home directory (at os.path.expanduser(~/shared)). Contents written here are saved to PiCloud’s distributed file system and replicated across all jobs. As this storage is persistent, the monthly storage fee applies.

Volatile storage (locations 1 and 2) and shared storage (location 3) each have their own 1 GB quotas. In general, cloud.files is the preferred data storage location; please contact PiCloud if you need a larger quota.

Note

There is no way for the client to directly access to PiCloud’s distributed file system. Consequently, we advise you to stick with cloud.files whenever possible.

External DB

With PiCloud, it is possible to read and write to external data sources. There are several reasons why you may want to do this:

  1. You already have a database for an application and prefer to read and write data directly into it.
  2. You are scraping data from external sources (e.g. websites)
  3. You wish to have communication between jobs and the client.

In general, it is easy to communicate with external sources. As PiCloud allows you open any connection, your existing communication code will operate seamlessly on PiCloud. (As long as no firewall policies block PiCloud!)

Be wary of using external connections. Letting jobs (and other entities, such as the client) communicate with each other through external means (including cloud.files!) complicates the data model, as this modified map example shows:

_images/job_map_db.PNG

Warning

Jobs can only open outbound connections. It is not possible to listen for incoming connections on a port.