Bucket

Your bucket gives you a key-value interface to store and retrieve data objects up to 5 GBs in size. You can interact with your bucket in Python with cloud.bucket or in the shell with picloud bucket.

Use Cases

In addition to reasons outlined in Why Store Data on the Cloud?, your bucket is great if:

  1. You want an easy way to store and retrieve objects.
  2. You want a way to share data publicly with a URL.
  3. You want fast store and retrieve performance for single objects.

Storing an Object

The simplest way to put data in your bucket is by uploading a file already on your local file system. In Python, use cloud.bucket.put():

>>> cloud.bucket.put('your_file.txt', 'obj_path')

If obj_path is omitted, then the file_path your_file.txt is used as the object path (key name). The function will block until the file has been uploaded, which may take some time for large files.

Note

Uploads are handled atomically (all or nothing). If your machine crashes during upload, the object will not appear in your bucket.

The shell equivalent is:

$ picloud bucket put your_file.txt obj-path

obj-path must be specified when using picloud bucket.

Storing only When Changed

If you have a large local file, you may wish to verify if the bucket copy is different before re-uploading it. Use sync_to_cloud:

>>> cloud.bucket.sync_to_cloud('your_file.txt', 'obj_path')
$ picloud bucket sync_to_cloud your_file.txt obj-path

Storing from Memory

In Python, you can upload directly from memory. cloud.bucket.putf() allows you to pass in a string, or file-like object (supports read() method) as an object.

Here’s an example of creating an object with path README with contents specified as a string in Python:

>>> cloud.bucket.putf('This the content of the file.\n', 'README')

Warning

If the file-like object does not support seek(), it will be buffered in memory to determine its size.

Storing a Python Object

To do this, you’ll need to convert your object into a string, a process known as serialization. You can do this using pickle from the Python standard library.

>>> import cPickle as pickle
>>> cloud.bucket.putf(pickle.dumps(obj), 'OBJ_KEY')

Listing and Organizing Objects

To see a list of objects in your bucket in lexical order, use list:

>>> cloud.bucket.list()
['data1', 'data2', 'data3']
$ picloud bucket list
data1
data2
data3

Using Hierarchies

Your bucket is not a traditional filesystem; there is no concept of directories. However, you use slashes, /, to group and organize your objects. This will make it easier when Searching by Prefix.

For example, you could separate objects into two classes by having all keys begin with classA or classB followed by a /:

classA/blue
classA/green
classA/red
classB/black
classB/white

Namespacing with Prefix

To make it easier to work with ‘/’ separators, almost all functions in cloud.bucket include an optional prefix keyword. If present, the prefix + ‘/’ is prepended to the obj_path to form an effective_obj_path. As an example, to put a file named blue into your bucket with key classA/blue, you can do:

>>> cloud.bucket.put('/path/to/blue', prefix='classA')
$ picloud bucket put --prefix=classA /path/to/blue blue

Prefixes provide a powerful way to namespace your objects. To ensure that object paths in one project never conflict with that of another, use a unique prefix for each project.

Searching by Prefix

Assume the bucket contains the class objects from the previous section. To get a list of only classB objects, use the prefix option:

>>> cloud.bucket.list('classB')
['classB/black', 'classB/white']

In the shell,

$ picloud bucket list --prefix classB
classB/black
classB/white

Note that the prefix searched for does not have to be followed by a slash. Any arbitrary string prefix will work:

>>> cloud.bucket.list('classA/gre')
['classA/green']

Truncation

list will never return more than the requested max_keys (up to 1,000) and sometimes will return even fewer. In Python, if list did not return all results, the resultant list’s truncated attribute will be set to True. In the CLI, truncation is indicated by the message ”...Results are truncated...” in stderr.

There are two ways to handle truncation:

  1. Use iterlist in lieu of list. iterlist guarantees that all results will be returned. Beware that because iterlist is internally making many serial requests to PiCloud, iterating over a large number of results will take a long time. In Python, cloud.bucket.iterlist() will return an iterator, as the name suggests.
  2. If an iterator is not a natural choice (say, for pagination), make use of markers.

Using Markers

To obtain further keys, make a subsequent call to cloud.bucket.list() with marker set to the last item returned in the previous results. Here’s an example of a query with a marker:

>>> cloud.bucket.list(marker='data2')
['data3']
$ picloud bucket list --marker data2
data3

data3 is the only returned key since the traversal was started at, but not including, data2. See the source code of cloud.bucket.iterlist() for an example of complete transversal with list.

Retrieving an Object

An object can be retrieved from any machine using get. In Python, use cloud.bucket.get():

>>> cloud.bucket.get('obj_path', 'file_path')

The bucket object at obj_path will be downloaded and stored on the filesystem as file_path. If file_path is omitted, obj_path will be the file_path.

In the shell,

$ picloud bucket get obj_path file_path

file_path must be specified. To use (the basename of) obj_path as the file_path, set file_path to ..

Retrieving only When Changed

If you have a large bucket object, you may wish to verify if the local file system copy is different before re-downloading it. Use sync_from_cloud:

>>> cloud.bucket.sync_from_cloud('obj_path', 'file_path')
$ picloud bucket sync_from_cloud obj_path file_path

Retrieving into Memory

In Python only, you can stream data from your bucket to your machine with cloud.bucket.getf():

>>> f = cloud.bucket.getf('obj_path')

f is a cloud.bucket.CloudBucketObject, a file-like object. It supports reading a fixed number of bytes read(n), and can be used as an iterator. Data is streamed from your bucket in chunks only on an as-needed basis.

Warning

If no data is fetched from your bucket for over 60 seconds, the connection will automatically timeout. Further calls to the CloudBucketObject will throw exceptions.

Selecting a Byte Range

You can choose to retrieve only a select portion of an object, specified by the start byte position and end byte position.

>>> cloud.bucket.get('obj_path', 'file_path', start_byte=10, end_byte=100)
>>> f = cloud.bucket.get('obj_path', start_byte=10, end_byte=100)

The above commands will retrieve only bytes 10-100 inclusive (total of 91 bytes) either to a file (get) or to a file-like object in Python (getf).

In the shell,

$ picloud bucket get --start-byte=10 --end-byte=100 obj_path .

Note

Neither sync_from_cloud nor put support byte ranges.

Obtaining Information about an Object

Use info to obtain meta-data about the object. Information includes object size, modification time, and the object’s md5 sum:

>>> cloud.bucket.info('obj_path')
{u'cache-control': None,
 u'content-disposition': None,
 u'content-encoding': u'None',
 u'content-type': u'application/octet-stream',
 u'created': u'Fri, 12 Oct 2012 05:02:16 GMT',
 u'last-modified': u'Wed, 05 Dec 2012 09:14:06 GMT',
 u'md5sum': u'9ea800a59d0b15eb7a6dbac0b7d582b5',
 u'public': True,
 u'size': 1267,
 u'url': u'https://s3.amazonaws.com/pi-user-buckets/kljdaslkjdas/obj_path'
 }
$ picloud bucket info obj_path
size: 1267
created: Fri, 12 Oct 2012 05:02:16 GMT
last-modified: Wed, 05 Dec 2012 09:14:06 GMT
md5sum: 9ea800a59d0b15eb7a6dbac0b7d582b5
public: True
url: https://s3.amazonaws.com/pi-user-buckets/kljdaslkjdas/obj_path
content-disposition: None
content-encoding: None
cache-control: None

Many of the fields relate to HTTP headers that are only relevant if the object is public.

Making an Object Publicly Accessible via HTTP

You can make any object publicly retrievable by anyone over HTTP. Just use make_public:

In Python,

>>> cloud.bucket.make_public('obj_path')
'https://s3.amazonaws.com/pi-user-buckets/[random folder]/[key name]'

The returned URL can now be used by anyone to download the contents of the bucket object obj_path.

In bash,

$ picloud bucket make-public obj_path
https://s3.amazonaws.com/pi-user-buckets/[random folder]/[key name]

You are responsible for the bandwidth consumed by users accessing the URL. To revoke public accessibility, use make_private:

>>> cloud.bucket.make_private('obj_path')
$ picloud bucket make-private obj_path

You can see a public object’s URL at anytime with the info query. Note that the format of the URL will always your public URL folder concatenated with the object path. You can see what your public URL folder is with the cloud.bucket.public_url_folder() python function or the picloud bucket public-url-folder shell command.

Controlling HTTP Headers

You may wish to control the HTTP headers that will be in the response to a request to the public URL. This can be done by specifying headers in the make_public call.

Possible standard HTTP headers are:

  • content-type
  • content-encoding
  • content-disposition
  • cache-control

All other headers are considered custom and will have x-amz-meta- prepended to them in the response.

Example:

>>> cloud.bucket.make_public('foo',headers={'content-type': 'text/x-python', 'purpose' : 'basic_script'})
$ picloud bucket make-public -d content-type=text/x-python -d purpose=basic_script foo

The headers in the response to a request for https://s3.amazonaws.com/pi-user-buckets/kljdaslkjdas/foo will include:

  • content-type: text/x-python
  • x-amz-meta-purpose: basic_script

Use the info command to see what headers have been set. Clear all custom headers with the reset_headers parameter.

Using an Object in a Job

In addition to accessing a bucket the same way that any PiCloud client does, where you get the object (Retrieving an Object), all of your buckets will appear on the job’s filesystem at the /bucket mount point.

Here’s a Python program (basic-examples/bucket/thumbnail.py) that creates a thumbnail of a locally-resident image called face.jpg using cloud.call and the bucket feature:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import cloud
import Image
import os

def thumbnail(obj_path):
   """Creates a thumbnail of an image object in bucket with path  *obj_path*.
   Output is a new object in bucket with 'thumb_' prepended."""

   thumbnail_filename = 'thumb_' + obj_path


   # Open the image directly from your bucket by accessing the /bucket mount
   img = Image.open(os.path.join('/bucket',obj_path))
   img.thumbnail((100,100), Image.ANTIALIAS)

   # save the image directly to your bucket by writing to a file within /bucket
   img.save(os.path.join('/bucket',thumbnail_filename), 'JPEG')

# put face.jpg into your bucket
cloud.bucket.put('face.jpg')

# run thumbnail() on the cloud
jid = cloud.call(thumbnail, 'face.jpg')

# wait for job to finish
cloud.join(jid)

# download image
cloud.bucket.get('thumb_face.jpg')

When this script is run on a machine, the highlighted lines will be run, not locally, but, on PiCloud.

Removing an Object

In python use cloud.bucket.remove():

>>> cloud.bucket.remove('your_file.txt')

The shell equivalent is:

$ picloud bucket remove your_file.txt

In Python, you can also pass a list of files to remove: .. code-block:: python

>>> cloud.bucket.remove(['your_file.txt', 'your_file2.txt'])

Removing all Objects

With cloud.bucket.remove_prefix(), you can remove all files that begin with a certain prefix (or even every bucket if prefix is set to ‘’).

To remove every file beginning with my_directory/ (that is contained within my_directory):

>>> cloud.bucket.remove_prefix('my_directory/')

The shell equivalent is:

$ picloud bucket remove-prefix my_directory/

Warning

Your bucket is used by our system (notebook, queues, etc.) to store information. Be weary of wiping all of your bucket objects!

Local vs. Cloud Performance

When using buckets from outside of PiCloud, the performance highly depends on your network connection. However, the following should be considered reasonable:

Read: 0.5 - 1 MB/s
Write: 0.5 - 1 MB/s

Performance is significantly better when using buckets from within PiCloud, or the Amazon us-east-1 datacenter. The following rates should be expected:

Read: 8 MB/s
Write: 4 MB/s

Eventual Consistency

As your Bucket is stored within the eventually consistent Amazon S3, your objects may take a few seconds to be visible after a put. In particular, if you replace the contents of an existing object with put, a subsequent get may temporarily return old data. For more information on Eventual Consistency, please see this guide (note that Buckets is in the US-Standard S3 region).