Primer

Welcome! The PiCloud Primer shows you how to get started with our most commonly used tools.

Definitions – Don’t Skip!

We use these terms everywhere, so it’s best you learn them now:

  • Client is you, or any machine you configure to use PiCloud.

  • cloud is our Python package that a client can use to execute Python functions on PiCloud. Examples using cloud will use the Python interpreter.

    >>> import cloud
    
  • picloud is our command-line interface (CLI) that a client can use to execute programs on PiCloud. Examples using picloud will use your shell.

    $ picloud
    

    If you’re a Windows user, your command prompt will work equally well with picloud.exe.

Creating a Job from Python



The way to use PiCloud in Python is to designate a function that you want to run on the cloud, instead of your own machine. Here we’ll walk through an example of getting a simple function running on the cloud.

Open Python interactively and define the add function.

def add(x, y):
   return x+y

Normally, you would just run the function locally by calling it:

>>> add(1, 2)
3

If you want to run it on the cloud, just pass your function into cloud.call():

>>> jid = cloud.call(add, 1, 2)

That’s it! You pass in arguments to add() by passing them as arguments in the same order to cloud.call(). Keyword arguments are just as easy: cloud.call(add, x=1, y=2).

cloud.call() is non-blocking; it returns immediately without waiting for add to actually run on the cloud. To verify, try this:

>>> import time
>>> time.sleep(30) # will sleep for 30 seconds
>>> cloud.call(time.sleep, 30) # returns immediately

By returning immediately, cloud.call() can’t give you the result of your function. What it returns instead is an integer jid (Job IDentification).

>>> print jid
1

For the remainder of the Primer, let’s assume the jid is 1. We’ll show you what you can do with a jid after we show how to create a job from the Shell. Can’t wait? Go to Using the Job Id.

Automagic Dependency Transfer

It’s important to understand that “getting on the cloud” is in fact three distinct steps.

digraph foo {
   rankdir=LR;
   node [shape=circle,fixedsize=false,width=1.5,fontsize=11,label="write a\nfunction"]; local;
   node [shape=circle,fixedsize=false,width=1.5,fontsize=11,style=filled,fillcolor=lightgray,label="deploy it on\nthe cloud"]; deploy;
   node [shape=circle,fixedsize=false,width=1.5,fontsize=11,style=filled,fillcolor=lightgray,label="run it\n"]; run;

   local -> deploy;
   deploy -> run;
}

Writing add() was step 1. cloud.call() handled steps 2 and 3, both deployment and running. cloud.call() handles deployment by automatically transfering the pure-Python source code and byte code necessary to execute your function from your machine to our servers. While this works for simpler use cases, it does not handle non-Python libraries and binaries, nor does it handle the transfer of datasets from your local filesystem.

To install programs and libraries with root access, you’ll need to use an Environment. For deploying data, see Data Storage.

Creating a Job from the Shell



The PiCloud command-line interface (CLI), picloud, is the preferred tool to use when you want to run a program, whether compiled or interpreted, on the cloud. No Python required. This is also the underlying feature when deploying R or Java.

While the primer illustrates the basic features of the CLI, we recommend reading Deploying an Application to see how to use the CLI in conjunction with other PiCloud tools to deploy complex workloads.

Here we’ll walk through a simple example using the default echo program, which simply prints to standard output the input you give it.

Open a shell and run echo locally with argument hello, world:

$ echo hello, world
hello, world

If you want to run echo on the cloud, use the same command-line string after picloud exec.

$ picloud exec echo hello, world
2

Just like cloud.call(), picloud exec is non-blocking. The integer returned is the jid; we assume it’s 2 for convenience.

You can also create template {variables}, which you can use to pass arguments into your program with the -d option. Template variables will be useful when executing the same program with many different arguments using Mapping. Here we construct ‘hello, world’ using two template variables that we substitute ‘hello’ and ‘world’ into.

$ picloud exec -d w1='hello' -d w2='world' echo {w1}, {w2}

In our examples, we’ll often enclose our picloud exec in backticks (not avaible on the Windows Command Prompt) to save the job id as an environment variable:

$ JID=`picloud exec echo hello, world`
$ echo $JID
2

To construct more sophisticated jobs from the shell, see Using a Script.

Deploying Your Program

Unlike cloud.call(), the CLI does not deploy any dependencies automatically.

digraph foo {
   rankdir=LR;
   node [shape=circle,fixedsize=false,width=1.5,fontsize=11,label="write a\nfunction"]; local;
   node [shape=circle,fixedsize=false,width=1.5,fontsize=11,label="deploy it on\nthe cloud"]; deploy;
   node [shape=circle,fixedsize=false,width=1.5,fontsize=11,style=filled,fillcolor=lightgray,label="run it\n"]; run;

   local -> deploy;
   deploy -> run;
}

The preceding example worked because echo already resides by default on PiCloud. To see what programs are there by default, see the contents of a Base Environment.

If you have a program you want to use that is not already on PiCloud, use either:

  1. An Environment if you need root access.
  2. A Volume to synchronize folders as shown in Deploying an Application.

Using the Job Id

Job identifiers are unique to your account. Your first job has jid 1, and it is incremented sequentially with each new job. All of PiCloud’s job-related facilities use jids. We’ll explore a few of them now.

Querying a Job’s Status

Below is a diagram of the possible statuses a job can have once it is created.

digraph foo {
   rankdir=LR;
   node [shape=circle,fixedsize=false,width=0.6,fontsize=8]; waiting;
   node [shape=square,fixedsize=true,width=0.6,fontsize=9]; error; stalled; killed;
   node [shape=circle,fixedsize=false,width=0.6,fontsize=8.style=filled,fillcolor=lightgray]; queued; processing;
   node [shape=square,width=0.6,fontsize=9,style=filled,fillcolor=lightgray]; done;

   subgraph cluster_1 {
             done; error; stalled; killed;
             color=white;
   }
   node [shape=circle,color=white,style=filled,fillcolor=white]; "new job";
   "new job" -> "queued";
   "new job" -> "waiting";
   "queued" -> "processing";
   "processing" -> "done";
   "processing" -> "error";
   "waiting" -> "stalled";
   "waiting" -> "queued";
   "queued" -> "killed";
   "processing" -> "killed";
   "waiting" -> "killed";
}

A job spends a variable amount of time in various steps before it is finished (shown as squares), at which point its status becomes permanent. Only then will its result, or reason for failure be available. The path of gray elements, queued -> processing -> done, is the most common. The full definition of statuses follows:

Status Definition
waiting Job is waiting until its dependencies are satisfied.
queued Job is in the queue waiting for a free core.
processing Job is running.
done Job completed successfully.
error Job errored (typically due to an uncaught exception).
killed Job was aborted by the user.
stalled Job will not run due to a dependency erroring.

To query a job’s status in Python:

>>> cloud.status(1) # job is still running
'processing'
>>> cloud.status(1) # job has finished
'done'

In the shell, you can equivalently do:

$ picloud status 1
done

Querying a Job’s Result

To get the result of the functions we ran earlier, use cloud.result():

>>> cloud.result(1)
3
>>> cloud.result(2)
'hello, world'

In the shell, you can equivalently do:

$ picloud result 1
3
$ picloud result 2
hello, world

Result calls will block until the job has finished, and therefore the result is ready.

Notice that cloud.result() and picloud result can query the results of jobs that were created by each other. Because the result of a job created by picloud exec will always be a string, cloud.result() will always be able to bring it natively into Python. However, picloud result will only be able to return the result of a job created by cloud.call() if the result is JSON serializable.

Waiting for a Job to Finish

If you just want to wait for job 1 to finish, without retrieving the result, use cloud.join():

>>> cloud.join(1)

In the shell, you can equivalently do:

$ picloud join 1

Warning

Do not use this to create a pipeline of consecutive jobs. For this, use dependencies.

Viewing a Job in the Dashboard

On the Job Dashboard, the two jobs you have created are listed. The left most column has the job’s id, and the right most column its status. To see a detailed report for a job, just click on its jid.

_images/job_dashboard_2.png

Batch Queries

Any function or command that operates on a job id, can take multiple of them. In Python, you can use any iterable:

>>> cloud.status([1,2]) # use a list
['done', 'done']
>>> cloud.result((1,2)) # use a tuple
[3, 'hello, world']
>>> cloud.join(xrange(1,3)) # use a sequence

In the shell, you can use a combination of ranges (smaller_jid-larger_jid inclusive) and single jids.

$ picloud status 1,2
jid         status
1           done
2           done
$ picloud result 1-2
Result for jid 1:
3
Result for jid 2:
hello, world
$ picloud join 1,2-3 # if jid 3 exists

There are two reasons to use batch queries:

  1. It’s more convenient to grab all results in one query, rather than using a for-loop construct.
  2. It’s more efficient. A single query to PiCloud of any kind will incur a round trip network delay; the time it takes for a packet sent by your machine to reach PiCloud, plus the time it takes for a packet to go from PiCloud to you. Making 10,000 individual queries will take 10,000 round trips. Making a single batch query for 10,000 jobs will take only 1 round trip. If a single round trip takes 50ms, then using a batch query will reduce the network overhead down from (10,000 round trips * 50ms) = 500 seconds down to 50ms.

Mapping

Mapping is the preferred way to run a function/command with many different input arguments. In Python, you would normally use the built-in python map function. Here’s an example of it in action:

>>> map(add, [1,2,3], [1,2,3])
[2, 4, 6]

This is equivalent to calling add three times with their respective arguments:

>>> [add(1,1), add(2,2), add(3,3)]
[2, 4, 6]

To run add on the cloud, just replace map with cloud.map().

>>> jids = cloud.map(add, [1,2,3], [1,2,3])

This creates three jobs, each executing add but with different arguments. Because three jobs are created, three jids are returned.

>>> print jids
xrange(3,6)
>>> print list(jids)
[3,4,5]

Since cloud.result() can take in a list of jids (see Batch Queries), we can do the following to get all the results:

>>> cloud.result(jids)
[2, 4, 6]

cloud.map() is designed for both ease-of-use and speed when applying the same function to a list of data. It is preferable to calling cloud.call() three times like so:

>>> jids = []
>>> for x,y in [(1,1), (2,2), (3,3)]:
       jids.append(cloud.call(add, x, y))

Not only is this less compact, but it will make three separate calls to PiCloud, each incurring a round trip network delay. This is explained in Benefits of Batch Queries.

Check out our Mapping Tips & Tricks to learn more about using cloud.map().

In the shell, the equivalent command is picloud mapexec. The -n flag is used to map a variable to a list of values delimited by a comma. The nth mapjob will use the nth value for its variable. Here we create two jobs using the -n flag, one with argument ‘hello’ for w1, and the other with argument ‘goodbye’ for w1. Since we used the -d flag to specify w2, w2 is set to ‘world’ for both jobs.

$ picloud mapexec -n w1='hello','goodbye' -d w2='world' echo {w1}, {w2}
6-7
$ picloud result 6-7
Result for jid 6:
hello, world

Result for jid 7:
goodbye, world

If you have many values, it may be tedious to delimit all of them with a comma. You may wish to use an file to specify your arguments, described in our Additional Mapexec Features.

Job Error Handling

A job is marked as error if:

  • Python: An uncaught Exception is raised during a job’s execution.
  • Shell: The program completes with a non-zero exit status.

Here are two examples of jobs that error:

>>> def fail():
...    return 1 + 'bob'

>>> cloud.call(fail)
8
$ picloud exec ls /dir-that-does-not-exist
9

When calling cloud.result or cloud.join on a job that has error-ed, a CloudException will be raised with the traceback:

>>> cloud.result(8)
CloudException: Job 1410125: Traceback (most recent call last):
  File "/usr/local/picloud/.employee/pimployee/job_util.py", line 114, in process_job
    result = func(\*args, \*\*kwargs)
  File "<ipython-input-2-5b9e929b5f82>", line 2, in fail
TypeError: unsupported operand type(s) for +: 'int' and 'str'

When executing picloud result or picloud join on a job that has error-ed, the return code will be 3.

$ picloud result 9
CloudException: command terminated with nonzero return code 2
$ echo $?
3

If you want the standard error for the job as diagnostic information, see Query Job Information.

More Horsepower

Choose a Core Type

PiCloud offers five types of cores for you to use, each with different characteristics. They are:

Type Use Case Compute Resources
c1 (default) Prototyping 1 compute unit, 300 MB of memory, low I/O performance.
c2 Simple Tasks 2.5 compute units, 800 MB of memory, medium I/O performance.
f2 Well-rounded 4 – 5.5 compute units, 3.7 GB of memory, medium I/O performance. Hyperthreading. See note.
m1 Memory-bound 3.25 compute units, 8 GB of memory, high I/O performance.
s1 Scraping Variable (max 2 c.u.), 300 MB memory, low I/O performance, unique ip address per concurrently running job.

A compute unit is defined by Amazon as providing the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Xeon processor. The cost for each core can be found on our Pricing Page. If these cores have insufficient speed, RAM, or I/O performance, then try using multiple of them at the same time by Using Multicore.

By default, jobs use the c1 core. To use a different core in Python, use the _type keyword:

>>> cloud.call(f, _type='f2')
>>> cloud.map(f, datapoints, _type='m1')

In the shell, use the -t argument:

$ picloud exec -t f2 program

Note

While the f2 core advertises up to 5.5 compute units, a job that uses a single thread on a single processor will not see a 70% improvement over the m1 core. Roughly 30% of the f2 speedup is contingent on having a multi-threaded program to take advantage of hyper threading. f2 cores will also see a larger speed up when Using Multicore.

Note

Unlike other core types, the c1 is not tied to a physical core. It actually represents a “virtual core” where up to 2.5 c1 core jobs are assigned to a single physical c2 core. At times, extra capacity may be available to allow a c1 job to burst to 2.5 compute units; however, only 1 compute unit of performance is guaranteed.

Warning

The performance of the s1 is highly variable, with orders of magnitude difference in performance. We only advise using it if you need a unique IP address per job; otherwise, please use the more reliable c1.

Using Multicore

By default, a job uses a single core of the type specified by Choose a Core Type. Choosing to use N cores gives your job access to N times the resources: compute power, RAM, and I/O performance. You should consider using multiple cores if your needs fall into one of the following:

  • You want to speed up your Python jobs, which use performance-focused libraries such as numpy. These libraries release the GIL whenever possible, which means that they can potentially leverage multiple cores.
  • You want to speed up non-Python jobs. Some programs will use as many cores as you can throw at it. Check the specific program you’re using.
  • Your job needs more RAM, you can use multiple cores to pool resources together. The most RAM a single core has is 8GB (m1) and you can pool 8 m1 cores together for 64GB.

The number of cores you can assign to a job depends on the type of core being used. The following table shows which combinations are valid:

Type Supported Multiples
c1 1
c2 1, 2, 4, or 8
f2 1, 2, 4, 8, or 16
m1 1, 2, 4, or 8
s1 1

If you want to assign multiple cores to a job, use the _cores keyword in Python:

>>> cloud.call(f, _type='f2', _cores=4)
>>> cloud.map(f, datapoints, _type='f2', _cores=4)

We use the f2 core since the default c1 core does not support multicore.

In the shell, use the -c argument:

$ picloud exec -t f2 -c 4 program

Note

How much does Multicore Help?

The benefits of multicore will depend on how your application is written. The easiest way to assess is by running your job twice, once with one core, and once with multiple. Based on how much faster the multicore job ran, if at all, you can judge for yourself if it provides substantial speed improvements to keep using.

Warning

Difficulty Scheduling

Because multicore requires all cores to be free on a single server, your jobs may be queued for a long period of time. To guarantee timely scheduling of multicore jobs, reserve at least as many Realtime Cores as your largest multicore multiple.

Job Scheduling

As you run more jobs, you’ll notice that jobs do not execute immediately. They enter a global queue shared by all users until there is available compute power. To understand what to expect, and how to get guaranteed capacity, see Realtime Cores.

Characteristics of an Ideal Job

Whether you’re using cloud for Python or picloud for shell, here are a couple of guidelines for your jobs:

  • Jobs should generally take at least second to run. This ensures that the network overhead of sending your job to PiCloud, our internal overhead, and returning the result back to you does not exceed the speedup of using PiCloud in the first place.
  • If you need to send data (> 1 MB) with each job, each job should have an even longer minimum runtime to account for the overhead of sending the data to PiCloud. Otherwise, you might as well not transfer the data, and take that time to run your job locally instead.

If you are using python and you find that your jobs’ runtimes are too fast, check our our section on Argument Chunking.

Handling Data

The most basic data storage PiCloud provides is transparently integrated into creating a job, and storing its result. When a job is created, the arguments to the function or command are stored automatically. Likewise, when a job completes, the return value is stored, which you can then retrieve (Querying a Job’s Result).

This basic method of input/output is limited. Inputs are by default restricted to 1MB (can be increased to 16MB by modifying the Configuration File), while outputs are bounded to 128MB. Most users will eventually need more flexibility with their data storage.

The questions you should ask yourself when thinking about data are:

  1. Will all the data a job needs be sent with the basic input/output system?
    • If so, am I okay with the overhead of uploading the dataset each time a job is created?
    • If not, see #3.
  2. Does my dataset already reside on the cloud? For example, a Mongo or MySQL store.
    • How can I get PiCloud to access the data? See Use Your Own database.
  3. I want to upload my data once, that way many jobs can use the dataset without re-uploading it each time.
    • Check out our Data Storage solutions. Your Bucket will be easiest to setup.
    • If all your data is in flat files, consider using a Volume.
    • Or you may want to move your dataset to the cloud your own way, see #2.

Understanding Dependencies

Pure-Python Dependencies When Using cloud

Functions you offload using cloud will generally depend on various Python packages. If those dependencies are pure Python (not C-extensions), they will be automatically transferred from your client to our servers, and then cached. They are only uploaded again if you modify code. We even track different versions of your dependencies in the case that different jobs use different versions.

This does not apply to non-Python dependencies, such as external programs, or Python dependencies that require compilation, such as C-extensions.

All Other Dependencies

If you’re using the CLI, or cloud with dependencies it cannot automatically transfer, you’ll want to understand how dependencies are handled.

Fundamentally, dependencies are just files that a job needs to be able to find on the filesystem. These files may be programs or libraries. To modify the filesystem your job sees to contain the dependencies you need, use an Environment.

If you’re wondering what filesystem is used when you do not have an environment, the answer is a Base Environment, which has a set of pre-installed programs.

Example

>>> import sys

>>> # pure python module. contains function bar.
>>> import mymodule

>>> # python packages with c-extensions
>>> import mycmodule
>>> import numpy

>>> k = 2

>>> def foo(x):
...     return sys.version, mymodule.bar(k + x)

>>> jid = cloud.call(foo,20)

In this example, foo will run successfully on PiCloud. Take note of the following:

  • Since foo uses the global variable k, k is pickled and sent to PiCloud as part of cloud.call().
  • If k were an object that could not be pickled, cloud.call() would raise an error.
  • Since mymodule is in pure Python, it will be sent to PiCloud as part of the cloud.call().
  • If mycmodule were used instead, the job would error with a “Cannot import mycmodule” error since mycmodule‘s usage of C-extensions means that it cannot be sent to PiCloud. To use mycmodule you will need an Environment.
  • If numpy were used, it would not be sent over for two reasons. First, it is a C-extension. Second, since it’s commonly used, it’s pre-installed on our Base Environment, which avoid the “Cannot import” error. However, the version of numpy we have installed may differ from the one you’re using, which may cause errors due to API changes. Use an Environment to install a specific version of numpy.

Deploying Applications

The preceding information takes a specific angle: how to add hooks into your program to offload computation to PiCloud. For a perspective on moving an application to the cloud, see Deploying an Application.

Where does a Job Run?

When there’s room on a node that executes jobs (“Worker Node”), we assign your job to it. Your job runs in a Linux Container (LXC) on the worker node. The use of containers is completely transparent.

Once executing, a job is a Linux process running on the worker node, similar to if it were running on your machine. The job runs as a regular user (not root), whose name is arbitrary (all names have prefix emp). If the job checks the list of processes, it will see there are few, as the entire system is dedicated to running the job.

The filesystem of the container is a Ubuntu Linux system (Base Environment). You can modify the filesystem the job sees to include dependencies by using an Environment.

To get a feel for where a job runs, check out how to SSH into a Job or use a Notebook.

Jobs Creating Jobs

Jobs that you create, can also create jobs. This is desirable because once a job is executing on PiCloud, it will have a high bandwidth, low latency connection with the rest of PiCloud.

Here’s an example in Python:

>>> def g():
...    return 'i am the child of f'

>>> def f():
...   return cloud.call(g)

>>> jid_f = cloud.call(f)
>>> jid_g = cloud.result(jid_f)
>>> cloud.result(jid_g)
'i am the child of f'

Here’s an example in the shell:

$ JID1=`picloud exec 'picloud exec echo hello world'`
$ JID2=`picloud result $JID1`
$ picloud result $JID2
hello world