Advanced usage

Advanced usage#

This repository is designed both for performance and for having a low memory footprint. Therefore, it provides bulk operations and the possibility to access objects as streams. We strongly suggest to use these methods if you use the disk-objecstore as a library, unless you are absolutely sure that objects always fit in memory, and you never have to access tens of thousands of objects or more.

Bulk access#

We continue from the commands of the basic usage. We can get the content of more objects at once:

container.get_objects_content([hash1, hash2])
# Output: {'6a96df63699b6fdc947177979dfd37a099c705bc509a715060dbfd3b7b605dbe': b'some_content',  'cfb487fe419250aa790bf7189962581651305fc8c42d6c16b72384f96299199d': b'some_other_content'}

For many objects (especially if they are packed), retrieving in bulk can give orders-of-magnitude speed-up.

Using streams#

Interface#

First, let’s look at the interface:

with container.get_object_stream(hash1) as stream:
    print(stream.read())
# Output: b'some_content'

For bulk access, the syntax is a bit more convoluted (the reason is efficiency, as discussed below):

with container.get_objects_stream_and_meta([hash3, hash1, hash2]) as triplets:
    for hashkey, stream, meta in triplets:
        print("Meta for hashkey {}: {}".format(hashkey, meta))
        print("  Content: {}".format(stream.read()))

whose output is:

Meta for hashkey 6a96df63699b6fdc947177979dfd37a099c705bc509a715060dbfd3b7b605dbe: {'type': 'packed', 'size': 12, 'pack_id': 0, 'pack_compressed': False, 'pack_offset': 0, 'pack_length': 12}
  Content: b'some_content'
Meta for hashkey cfb487fe419250aa790bf7189962581651305fc8c42d6c16b72384f96299199d: {'type': 'packed', 'size': 18, 'pack_id': 0, 'pack_compressed': False, 'pack_offset': 12, 'pack_length': 18}
  Content: b'some_other_content'
Meta for hashkey d1e4103ce093e26c63ce25366a9a131d60d3555073b8424d3322accefc36bf08: {'type': 'loose', 'size': 13, 'pack_id': None, 'pack_compressed': None, 'pack_offset': None, 'pack_length': None}
  Content: b'third_content'

Important

As you see above, the order of the triplets IS NOT the order in which you passed the hash keys to get_objects_stream_and_meta. The reason is efficiency: the library will try to keep a (pack) file open as long as possible, and read it in order, to exploit efficiently disk caches.

Memory-savvy approach#

If you don’t know the size of the object, you don’t want to just call stream.read() (you could have just called get_object_content() in that case!) because if the object does not fit in memory, your application will crash. You will need to read it in chunks and process it chunk by chunk.

A very simple pattern:

# The optimal chunk size depends on your application and needs some benchmarking
CHUNK_SIZE = 100000
with container.get_object_stream(hash1) as stream:
    chunk = stream.read(CHUNK_SIZE)
    while chunk:
        # process chunk here
        # E.g. write to a different file, pass to a method to compress it, ...
        chunk = stream.read(CHUNK_SIZE)

You can find various examples of this pattern in the utility wrapper classes in disk_objectstore.utils.

Note also that if you use get_objects_stream_and_meta, you can use meta['size'] to know the size of the object before starting to read, so you can e.g. simply do a .read() if you know the size is small.

Finally, when writing objects, if the objects are big, instead of reading in memory the whole content, you should use the methods container.add_streamed_object(stream) (loose objects) or add_streamed_objects_to_pack(stream_list) (directly to packs).