Advanced Job Management

In addition to the basic job operations covered in the Primer, we offer a variety of advanced functions and commands to assist you in managing your computation.

Query Job Information

Using info you can see a variety of output and statistics collected for each job. info can be queried while a job is running for realtime data, or after it has finished. The queryable information is as follows:

Info Description
stdout Standard output of a job (last 64k chars)
stderr Standard error of a job (last 64k chars)
logging Messages from the Python logging module (last 64k chars)
pilog Messages from PiCloud about your job.
syslog System log (last 64k chars)
runtime Wall-clock time that the job took to finish in seconds.
status The status of the job. Also retrievable by Querying a Job’s Status
profile If profiling was enabled, profile of the function that ran
cputime.user CPU Time spent in user space.
cputime.system CPU Time spent in kernel space.
memory.failcnt Number of times memory allocation failed.
memory.max_usage Peak amount of memory used during job processing in bytes.
swap.max_usage Peak amount of swap used during job processing in bytes.
memory.usage Current memory usage (only valid when processing) in bytes.
swap.usage Current swap usage (only valid when processing) in bytes.
disk.usage Current disk usage (only valid when processing) in bytes.
ports Returns a dictionary of ports job is listening on (only valid when processing)
attributes All attributes of the job defined at its creation (e.g. env, vol)
all Obtain info about every possible item

To query the information in Python, use the following:

>>> cloud.info(jid, ['stdout', 'memory.failcnt', 'cputime.user'])

In the shell,

$ picloud info -o stdout,memory.failcnt,cputime.user jid

If the information requested isn’t specified, info automatically returns stdout, stderr, runtime, and status. info also supports Batch Queries. If a batch query is used, each jid is returned as a key in a dictionary.

Here’s an example Python program you can use to get a hang of the feature:

>>> def foo():
...   print "Output"
...   print >> sys.stderr, "An Error"
>>> jid = cloud.call(foo)
>>> cloud.join(jid)
>>> cloud.info(jid, ['stderr', 'stdout'] )
{jid: {'stderr': 'An Error\n', 'stdout': 'Output\n'}}

You can also view output in the web view for jobs.

Warning

If you are inspecting a killed job, note that only information flushed to stdout or stderr will show. Text that was in output buffers when a job was killed will not be shown. In Python, you can avoid this problem by manually flushing output with sys.stdout.flush() and sys.stderr.flush().

Kill Jobs

Any job can be aborted, whether it’s waiting in the queue, or being processed. There are two reasons why you might want to abort a job:

  1. You no longer want the job to run, or care for its output.
  2. The job is behaving abnormally for unknown reasons related to your code. In the worst case, it is taking too long to run because it has become stuck in an infinite loop.

Aborting a finished job does nothing. Unfinished jobs have their status set to killed.

To abort, use kill. In Python:

1
2
3
4
5
6
def infinite_loop():
   while True:
      pass

jid = cloud.call(infinite_loop) # start a job which will never end
cloud.kill(jid)                 # at least until you kill it

In the shell,

$ picloud exec sleep 100
[jid]
$ picloud kill [jid]
$ picloud status [jid]
killed

Processing jobs will receive a SIGTERM signal. If they do not exit willingly in one second, then the jobs will be forcibly terminated. Python jobs that willingly exit will also provide an exception traceback showing at what point the SIGTERM was received.

kill supports Batch Queries. If you pass no arguments into kill, all unfinished jobs are killed.

Delete Jobs

Use delete to remove all data related to a job from PiCloud’s servers. Only meta data such as the job’s job id, finished status, created time, finish time, and runtime will be maintained for billing purposes. Once a job is deleted, using its job id in future commands will give an error.

In Python,

>>> cloud.delete(jid)

cloud.delete() is also important when using the simulator or cloud.mp. See the memory issues section for more information.

In the shell,

$ picloud delete [jid]

Note

A job must be finished (Querying a Job’s Status) for it to be deleted.

Priorities

Use job priorities to affect the order in which your jobs are scheduled for processing.

When you create jobs, they are added to your queue with a default priority of 5. Jobs with the same priority are run in FIFO (First in, first out) order. If you specify a priority, PiCloud will attempt to schedule jobs with the lower priority number first, though order is not guaranteed.

In Python,

>>> cloud.call(f, _priority=1)

In the shell,

$ picloud exec --priority 1 my_program

Labels

When viewing your jobs on the Accounts web page, it is useful to have your functions labeled to easily identify and search for jobs. Use the _label keyword argument to assign a string label to your function.

In Python,

>>> cloud.call(f, _label='first job')
$ picloud exec --label "first job" my_program

Max Runtime

You can specify a maximum runtime in minutes, such that, if a job exceeds its maximum, it will be automatically killed. This is good practice to ensure that you never accrue a large bill because of a runaway job (a job that exceeds the runtime you expect of it).

There are a couple of broad reasons why you might have a runaway job:

  • Bug in your code causing non-terminating behavior.
  • Unexpected behavior that exhibits itself only when run on PiCloud, and not locally during testing.

Most runaways are caused by the former issue. The latter is exceedingly rare, and should be reported to us with a support ticket, but just in case, we recommend always setting a maximum runtime.

To set a maximum runtime of 1 minute in Python,

>>> cloud.call(f, _max_runtime=1)

In the shell,

$ picloud exec -m 1 my_program

Dependencies

Using dependencies, a job can be held in the queue until another job(s) has finished. A common use case is when a job depends on the output of another job.

The following shows a basic structure for using job dependencies in Python with the _depends_on keyword:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def f1():
   # do stuff

def f2(jid):
   result_of_f1 = cloud.result(jid)
   # do stuff with result_of_f1
   ...

f1_jid = cloud.call(f1)
f2_jid = cloud.call(f2, f1_jid, _depends_on=f1_jid)

In the shell,

$ picloud exec program
[jid_program]
$ picloud exec --depends-on [jid_program] program2 [jid_program]
[jid_program2]

depends on accepts both an individual jid or a sequence of jids using the Batch Queries notation.

Policy on Errors

By default, if job B depends on job A and job A errors, job B will never run and its status will be set to stalled. Errors can be ignored (allowing job B to run) by using the depends on errors setting.

In Python:

>>> cloud.call(f, _depends_on=jid, _depends_on_errors='ignore')

In the shell,

$ picloud exec --depends-on-errors ignore program

Use Case

Even in cases where dependencies are not strictly necessary, they should be used. Consider the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import time

def job1():
    time.sleep(30)
    return 1

def job2(jid):
    return cloud.result(jid)

if __name__ == '__main__':
    jid1 = cloud.call(job1)
    jid2 = cloud.call(job2, jid1, _depends_on = jid1)
    ret = cloud.result(jid2) #the result of both

If _depends_on were not present, job2 would run simultaneously with job1. If there is no queuing delay, job2 will run for the full 30 seconds that job1 is running, just busy waiting; doubling your total computation bill.

Failover

No computing infrastructure is immune to hardware failure, including that which PiCloud runs on. There is an extremely small possibility that a computer will fail while processing a job. While such a failure will never occur for most users, if you are running millions of jobs, the possibility of a hardware failure affecting one rises.

By default, PiCloud assumes it can safely restart jobs that fail while executing. However, this is not always true; a failed job may be manipulating some form of external state (e.g. writing to a database) and blindly restarting the job could cause data corruption.

If your job writes to an external source and you cannot design your job to recover from failure, you may wish to set the _restartable keyword to False. A hardware failure will then result in the job being given a status of error, rather than the job being restarted on another server.

Example:

>>> def foo():
...     """writes to 2 databases.  If a failure occurs in between the writes,
...     this job cannot be safely restarted"""
...
...     write_to_database_1()
...     write_to_database_2()

>>> # if foo() fails due to hardware failure, the job's result will be an exception
>>> cloud.call(f, _restartable=False)

Example 2:

>>> def square(x):
...     return x*x

>>> # _restartable=True can be omitted since it's default
>>> cloud.call(square, 10, _restartable=True)

In the shell,

$ picloud exec --not-restartable program

Listening on Ports

In general, the hostname of the server that a job runs on is not revealed to the user. However, if a job opens a listening port, it is necessary to know the publically accessible hostname of the server. Moreover, our systems introduce a NAT layer so that if you job opens port 8080, you may have to access it by opening port 20100.

Thus, to get both the hostname and port for the listening socket, you’ll need to use the ports key from Query Job Information. The following example will demonstrate:

1
2
3
4
5
6
7
8
from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler

def open_socket():
    """Opens port 8000 and listens for HTTP request. Only one
    request is handled before function returns."""

    httpd = HTTPServer(('', 8000), BaseHTTPRequestHandler)
    httpd.handle_request()

If you run open_socket() locally, you can open a browser and go to http://localhost:8000 to see the webpage served by HTTPServer. Note that you’ll see a 501 error page since we’re using the BaseHTTPRequestHandler, which does not know how to handle a GET request.

Now run open_socket() as a job:

>>> jid = cloud.call(open_socket)
>>> cloud.status(jid)
'processing'
>>> # once the job is processing, use cloud.info
>>> cloud.info(jid)
{1468537L: {'ports':
              {'tcp':
                 {8000: {'address': 'ec2-50-17-139-223.compute-1.amazonaws.com',
                         'port': 20501}},
               'udp': {}}}}

The keys for the tcp dictionary are the ports that are being listened on. The values are another dictionary specifying the hostname and externally accessible port. As you can see, to access the listening socket on port 8000, you would go to http://ec2-50-17-139-223.compute-1.amazonaws.com:20501.

A shortcut function exists to block until a job has opened the desired port:

>>> cloud.shortcuts.get_connection_info(jid, 8000)
{'address': 'ec2-50-17-139-223.compute-1.amazonaws.com', 'port': 20501}

You may have noticed that your ports info request also returned that port 22 (SSH) is open. See how to SSH into a Job for more information.

SSH into a Job

While a job is running, you can SSH into the system that it is running on (Where does a Job Run?).

To demonstrate, we can run a job that sleeps for 100 seconds.

$ picloud exec sleep 100
[jid]

On a Linux or Mac machine, we can SSH into the system as follows:

$ picloud ssh [jid]

picloud ssh blocks until the job is processing. You’ll notice that your terminal is now SSH-ed into the system as a regular user with name empX. Feel free to explore.

emp2@c-2:~$ pwd
/home/picloud
emp2@c-2:~$ whoami
emp2

You can even run Python, or any other program made available by default or by a custom Environment.

Warning

Do not interfere with job_task.py

emp2@c-2:~$ ps ax | grep python
143 ?        Ss     0:00 python /usr/local/picloud/.employee/job_task.py

The job_task.py process is a special process that facilitates the execution of your job. Terminating, or interfering with it, will result in your job being instantly killed, and an incident report logged.

Once a job terminates, in this case after 100 seconds, your connection to the server will be disconnected.

Convenience Function

If you wish to create a simple job you can SSH in to, perhaps to test your environment and volume configuration, you can run picloud exec-shell. exec-shell takes the standard picloud exec parameters. allowing you to specificy core type, environment, volumes, etc.

picloud exec-shell will automatically create a job and SSH you into it. After no more more SSH connections are active, the job will automatically be terminated (run picloud exec-shell --keep-alive to disable this termination behavior).