Scientific Tools for Python: numpy, scipy, pandas, and more

We’ve pre-installed the latest scientific tools for Python on PiCloud. This means that in most cases you can use them with zero setup.

Example

To demonstrate, we’ll offload a function that sums the values in a numpy array.

>>> import cloud
>>> import numpy

>>> def f(A):
...    return numpy.sum(A)

>>> B = numpy.array([1,2,3,4,5])

>>> jid = cloud.call(f, B)
>>> cloud.result(jid)
15

As you can see, offloading f required no more effort than offloading any other function on PiCloud

Under the Hood

You may be wondering whether we used our Automagic Dependency Transfer system to transfer your local numpy package to PiCloud. The answer is that we did not since we detected that numpy is pre-installed on PiCloud. Instead, only the numpy array, B, is serialized, and sent to PiCloud. Since we already have numpy, deserialization works without a hitch.

What Scientific Libraries are Installed?

We have dozens of scientific Python packages installed by default. Check the contents of the Base Environment relevant to you to see what is installed including the version number. Notable packages include:

  • biopython
  • h5py
  • matplotlib
  • nltk
  • numpy
  • pandas
  • rpy2
  • scikits learn
  • scikits image
  • scikits statsmodels
  • scipy

Intel Math Kernel Library

Our default versions of numpy and scipy are custom-compiled to use the Intel Math Kernel Library (MKL). MKL provides significant performance improvements for certain operations, particularly on the hyperthreading-enabled f2 core.

Missing a Package?

To install a library we do not have, you’ll need to create a custom Environment. This is the most common solution for users whose jobs error with the message that package “X could not be imported”.

Changing Package Versions

To change the version of a library, create a custom Environment. You’ll need to use pip uninstall to remove the package in question, before installing your preferred version.

Common Pitfalls

Passing Too Much Data

A common mistake is to pass over 1 MB of data to cloud.call. To demonstrate the issue, assume the same f function from earlier that sums the values of a numpy array:

# array with 1 million entries (> 1 MB)
>>> A = numpy.arange(1000000)

>>> jid = cloud.call(f, A)
CloudException: Excessive data (8000158 bytes) transmitted.
Snapshot of what you attempted to send:
<PickledObject type='tuple' size='8000155' memo_id='15' numElements='1' containedType='numpy.ndarray'>
    <Element entryNum='0' type='numpy.ndarray' size='8000152' memo_id='5'>
       ...
    </Element>
</PickledObject>

We disallow sending large objects since most times users are unknowingly passing an object with many references to other objects. The output of serialization becomes huge, which introduces a significant overhead of sending the data to PiCloud.

If you decide that you actually need to send this much data, increase max_transmit_data in cloudconf.py (max 16 MB). In general, we recommend using your Bucket by reading the section on Using an Object in a Job.

Ignoring Batch Operations

A common use case is to run the same function across many different inputs. Instead of calling cloud.call repeatedly, you should use cloud.map as explained in Mapping.

Likewise, instead of calling cloud.result on each job individually, you should use Batch Queries.

Users have achieved orders of magnitude speed gains by using batch operations.

Reducing Result Sets Locally

Another use case is reducing the results of jobs:

>>> jids = cloud.map(f, datapoints) # assume it creates 1000 jobs
>>> results = cloud.result(jids)
>>> process_results(results)

The above is inefficient since the results of all your jobs are downloaded to your local machine. Depending on the size of your result sets, this can take a long time.

Instead, you should execute process_results on the cloud where the results are fetched 10-50x faster.

>>> def process_results_on_cloud(jids):
...     results = cloud.result(jids)
...     process_results(results)

>>> jids = cloud.map(f, datapoints)

# runs reduction on cloud
>>> reducer_jid = cloud.call(process_results_on_cloud, jids, _depends_on=jids)

# output of process_results()
>>> cloud.result(reducer_jid)

Note that we use the _depends_on keyword to ensure the reduction step does not begin until all jobs from the map have been completed. This is explained in Dependencies.