This system overview assumes that you have familiarized yourself with PiCloud by reading through the Client Basics and Examples.
PiCloud consists of both a cluster of servers (the cloud) and a Python package that allows clients to interoperate with those servers.
The PiCloud client, cloud, is a Python package which allows users of PiCloud to run almost any functions defined within Python on PiCloud’s cloud. It has a variety of tasks, including passing commands to and from PiCloud’s service, (de)serializing data, caching results, introspecting code, and providing an additional debugging interface. More details are described below.
PiCloud’s server architecture consists of a cluster of highly scalable components, including databases, load balancers, distributed file systems and “worker machines”. These components work together to deliver PiCloud’s job execution, monitoring, and auto-scaling platform. PiCloud abstracts all of these away, so you can run code on the cloud without worrying about the details of server management.
Each job (a function and its arguments) is sent to PiCloud and evaluated on a “worker”. The worker is an Amazon EC2 virtualized instance running a 64 bit version of Ubuntu Linux.
A job is run by executing it within a Python interpreter (it being the function passed through cloud.call – there is no __main__ on PiCloud). Python 2.5 or 2.6 functions are executed in a CPython2.6 interpreter within an Ubuntu 10.10 environment, while Python 2.7 functions are executed in a CPython2.7 interpreter within an Ubuntu 11.4 environment. A substantial amount of packages are installed on the worker and available to any job:
Pure-python packages not pre-installed are handled by the mechanisms described below. All other types of packages and binaries can be deployed to PiCloud by extending the environment.
Workers are multi-tenant in that multiple users’ jobs are run simultaneously, sandboxed within their own interpreter and Linux user accounts. To optimize job startup time, your jobs will sometimes run in the same interpreter as your previous jobs. If you change your source code (client-side), all of your held-open interpreters will shut down, ensuring that jobs are run under the correct code. If desired, you can disable this optimization with the _kill_process command.
Additional security is provided via POSIX permissions and AppArmor . Available CPU power is split amongst the different jobs, see the high cpu option for more detail. All jobs are guaranteed at least their requested CPU power (by default, 1 Amazon EC2 compute unit, equivalent to to a 1-1.2 GHz 2007 Xeon), but may receive more if resources are available.
Most of the time, there will be space available on worker machines to run your job immediately after its creation. However, when there is not, the job will be placed in our queue. Be aware that this is a common queue that holds all jobs submitted by all users; consequently your jobs wait not only on your own jobs to complete but also those of other users.
As the queue grows, PiCloud adds additional workers machines to handle the load; this auto-scaling allows for more of your jobs to run in parallel. The actual mechanics of the auto-scaling are a bit complicated, but are basically being driven by two competing factors: our desire to run your computation as fast as possible and that the minimum time a server can be provisioned for is one hour. As an example, if you submit 100 1 hour jobs to us, we will try to provide you with 100 cores concurrently; however, if you give 100 2 minute jobs, you are unlikely to receive 100 concurrent cores automatically as it would be exceedingly costly for us to provision such resources.
Warning
Initially dividing each job into multiple jobs will achieve greater parallelism. However, at some point, PiCloud will not be able to cost-effectively provision the needed additional worker machines, unless you acquire real-time cores.
To guarantee immediate execution, you can purchase Real-Time cores. While N Real-Time cores are active, you will always be able to run at least N jobs in parallel. Real-Time cores effectively reserve workers for you. As your jobs come in, they are placed first on your personal workers. Once those workers fill up, an additional job will wait until either a personal or shared worker is free (whichever comes first).
With realtime cores, you can guarentee that thousands of jobs processs in parallel!
Note
Realtime cores operate on a per compute type basis. For instance, allocating c1 cores has no impact on c2 jobs.
For more information about scheduling, please see our blog post.
cloud allows even the most complex Python functions to run on PiCloud. cloud will analyze functions passed in via cloud.call() or cloud.map() to determine what additional information must be sent. What is going on is best explained via an example:
import sys
import mymodule #some user module that defines a function 'bar'
k = 2
def foo(x):
return sys.version, mymodule.bar(k + x)
jid = cloud.call(foo,20)
To evaluate foo on PiCloud, several pieces of information need to be known:
cloud handles the function code by serializing it with its extended pickler. (The process of serializing Python objects is known as pickling; in this documentation, serialization and pickling are used interchangeably.) The pickler analyzes function code to identify what variables are used. Those identified variables are in turn transmitted to PiCloud; in this case cloud will detect that global variable k is used and will consequently transmit the value of 2.
Modules are handled in a similar way. The function is analyzed to see what references to modules it has: in the above example, sys and mymodule. Modules that PiCloud already has installed (e.g. sys) are ignored, but ones that are not (mymodule) are analyzed. cloud searches code within non-installed modules (mymodule) for import statements; cloud recursively transmits and analyzes any custom modules found. Transmitted modules are cached on PiCloud and updated whenever changed locally.
Note
All modules are evaluated on PiCloud. Consequently, in the above example, cloud.result(jid)[0] will return the Python interpreter version running on PiCloud, not the one running locally.
While the cloud module is able to transmit any code written in Python, the same is not true with custom extensions written in C/C++. Custom extensions cannot be automatically uploaded to PiCloud. If yours is not already installed on PiCloud (2.6/2.7), you will need to extend the environment.
PiCloud has been designed for highly parallelizable tasks that have little interaction between each other. The overwhelming majority of problems fall into this category.
To make the optimal use of PiCloud, it is paramount to understand the underlying data flow model, which will be discussed in this section through the use of several diagrams.
This section will discuss the simplest, and most common, use case of PiCloud - where jobs do not communicate with any external data sources (cloud.files is considered an external data source).
Under this model, jobs running on PiCloud receive data only before executing in the form of the function and its arguments. The only data transmitted is the return value of the function (as well as standard out and error). Jobs do not communicate with each other or any external source. Any side-effects of the job (e.g. modifying global variables) is not seen by the client. This model is a slightly more general case of functional programming
Consider the following diagram, which shows the interaction between a client and PiCloud. Here, the client has generated Job 1 via a single cloud.call() argument, represented by the blue arrow.
The red arrow from Job 1 to the client indicates that the client knows the jid (Job Identifier) of Job 1; consequently, it can access information about the job (status, result, etc.) and even manipulate the job (e.g. kill, delete).
Note the lack of a red arrow from Job 1 to itself. There is none because Job 1 is not allowed to interact with itself (it wouldn’t make much sense to access your own result while you are running!). In fact, Job 1 cannot retrieve its own jid.
A key part of PiCloud’s transparency is the little difference between functions running on the client and on PiCloud. Indeed, jobs running on PiCloud can make use of cloud and even spawn new jobs. Consider the following code:
def job1():
jid2 = cloud.call(job2)
return jid2 #returns jid of job2!
def job2():
return 1
if __name__ == '__main__': #Client entry point
jid1 = cloud.call(job1) #jid of job1
jid2 = cloud.result(jid1) #job1 returns jid of job2
Here Job 1 has used cloud.call() to create another job, Job 2. This can be represented with the following diagram:
Note that even though job1() was running on PiCloud, it could, like the client, create additional jobs. As the digram shows, Job 1 can trivially access information about Job 2, as cloud.call() returned Job 2’s jid. Because (and only because) Job 1 in turn returns Job 2’s jid, the client can also access Job 2, as represented by the arched arrow.
Job 2, however, cannot access information about Job 1. As explained earlier, jobs only receive data via their arguments. As Job 1 does not know its own jid, it cannot not pass its jid to Job 2. The attentive reader will note that if Job 2 could access the result of Job 1, deadlock could occur (by jobs 1 and 2 accessing each others’ result).
A similar pattern exists when one job is spawned after another from the client. We can pass the jid from the first job into the second:
def job1():
return 1
def job2(jid):
return cloud.result(jid)
if __name__ == '__main__': #Client entry point
jid1 = cloud.call(job1) #jid of job1
jid2 = cloud.call(job2, jid1)
ret = cloud.result(jid2) #the result of both
Note
This particular code can be optimized by making job2 dependent on job1, as described in the Dependencies section.
The diagram describing the interaction is shown below:
Note that there is no way for job1 to access job2.
When making a cloud.map() call, multiple jobs are generated simultaneously. For instance, the code:
jids = cloud.map(func,[4,5])
ret = cloud.result(jids)
can be represented with the following diagram, where one job evaluates func(4) and the other func(5):
Parallel map jobs are independent of each other and cannot exchange information.
The following diagram is the result of the previous examples being merged together. Jobs are created in numeric order. The blue arrows indicate a cloud.call() and the purple arrows indicate a cloud.map(). The source of the red arrow can be accessed by the destination (the bold arrows indicate a direct case, the smaller arrows indicate that access is possible if jids are passed around in the creating jobs’ return values).
This diagram is rather complex, but note how the red arrows form a directed acyclic graph. This property ensures that it is impossible to ever go into deadlock when using the blocking cloud.result().
PiCloud’s functional programming model is very powerful, but you may wish to be able to directly read and write data over a network. The easiest way to share data between the client and jobs is to store it on PiCloud’s S3 store with cloud.files. Several other options exist:
In general, jobs running on PiCloud cannot access the client’s file system. However, read-only files will pass through; the following code is a naive way to transport files from the client to PiCloud:
def foo(f):
contents = f.read()
#do more
f = open('data.txt','r')
cloud.call(foo,f)
While this code works as expected, the file isn’t actually copied to the job’s file system. Instead, the file is read into memory as a StringIO and sent to PiCloud where it will also live in memory. It is not cached, so data.txt is retransmitted on every call.
Again, the file cannot be open for writing: you can read more about the limitations here.
PiCloud also provides access to the file system managed by the OS that jobs run within. Be aware though that PiCloud’s file system is not persistent across jobs. Once your job completes, everything you wrote will be deleted.
If you need storage persistent across jobs, please use cloud.files.
You can also place small data files on PiCloud in a custom environment.
With PiCloud, it is possible to read and write to external data sources. There are several reasons why you may want to do this:
In general, it is easy to communicate with external sources. As PiCloud allows you open any connection, your existing communication code will operate seamlessly on PiCloud. (As long as no firewall policies block PiCloud!)
Be wary of using external connections. Letting jobs (and other entities, such as the client) communicate with each other through external means (including cloud.files!) complicates the data model, as this modified map example shows:
Warning
Jobs can only open outbound connections. It is not possible to listen for incoming connections on an arbitrary port.