We’ve pre-installed the latest scientific tools for Python on PiCloud. This means that in most cases you can use them with zero setup.
To demonstrate, we’ll offload a function that sums the values in a numpy array.
>>> import cloud >>> import numpy >>> def f(A): ... return numpy.sum(A) >>> B = numpy.array([1,2,3,4,5]) >>> jid = cloud.call(f, B) >>> cloud.result(jid) 15
As you can see, offloading f required no more effort than offloading any other function on PiCloud
You may be wondering whether we used our Automagic Dependency Transfer system to transfer your local numpy package to PiCloud. The answer is that we did not since we detected that numpy is pre-installed on PiCloud. Instead, only the numpy array, B, is serialized, and sent to PiCloud. Since we already have numpy, deserialization works without a hitch.
We have dozens of scientific Python packages installed by default. Check the contents of the Base Environment relevant to you to see what is installed including the version number. Notable packages include:
- scikits learn
- scikits image
- scikits statsmodels
Our default versions of numpy and scipy are custom-compiled to use the Intel Math Kernel Library (MKL). MKL provides significant performance improvements for certain operations, particularly on the hyperthreading-enabled f2 core.
To install a library we do not have, you’ll need to create a custom Environment. This is the most common solution for users whose jobs error with the message that package “X could not be imported”.
A common mistake is to pass over 1 MB of data to cloud.call. To demonstrate the issue, assume the same f function from earlier that sums the values of a numpy array:
# array with 1 million entries (> 1 MB) >>> A = numpy.arange(1000000) >>> jid = cloud.call(f, A) CloudException: Excessive data (8000158 bytes) transmitted. Snapshot of what you attempted to send: <PickledObject type='tuple' size='8000155' memo_id='15' numElements='1' containedType='numpy.ndarray'> <Element entryNum='0' type='numpy.ndarray' size='8000152' memo_id='5'> ... </Element> </PickledObject>
We disallow sending large objects since most times users are unknowingly passing an object with many references to other objects. The output of serialization becomes huge, which introduces a significant overhead of sending the data to PiCloud.
If you decide that you actually need to send this much data, increase max_transmit_data in cloudconf.py (max 16 MB). In general, we recommend using your Bucket by reading the section on Using an Object in a Job.
Likewise, instead of calling cloud.result on each job individually, you should use Batch Queries.
Users have achieved orders of magnitude speed gains by using batch operations.
Another use case is reducing the results of jobs:
>>> jids = cloud.map(f, datapoints) # assume it creates 1000 jobs >>> results = cloud.result(jids) >>> process_results(results)
The above is inefficient since the results of all your jobs are downloaded to your local machine. Depending on the size of your result sets, this can take a long time.
Instead, you should execute process_results on the cloud where the results are fetched 10-50x faster.
>>> def process_results_on_cloud(jids): ... results = cloud.result(jids) ... process_results(results) >>> jids = cloud.map(f, datapoints) # runs reduction on cloud >>> reducer_jid = cloud.call(process_results_on_cloud, jids, _depends_on=jids) # output of process_results() >>> cloud.result(reducer_jid)
Note that we use the _depends_on keyword to ensure the reduction step does not begin until all jobs from the map have been completed. This is explained in Dependencies.