Advanced usage#
This repository is designed both for performance and for having a low memory footprint.
Therefore, it provides bulk operations and the possibility to access objects as streams.
We strongly suggest to use these methods if you use the disk-objecstore as a library,
unless you are absolutely sure that objects always fit in memory, and you never have to
access tens of thousands of objects or more.
Bulk access#
We continue from the commands of the basic usage. We can get the content of more objects at once:
container.get_objects_content([hash1, hash2])
# Output: {'6a96df63699b6fdc947177979dfd37a099c705bc509a715060dbfd3b7b605dbe': b'some_content', 'cfb487fe419250aa790bf7189962581651305fc8c42d6c16b72384f96299199d': b'some_other_content'}
For many objects (especially if they are packed), retrieving in bulk can give orders-of-magnitude speed-up.
Using streams#
Interface#
First, let’s look at the interface:
with container.get_object_stream(hash1) as stream:
print(stream.read())
# Output: b'some_content'
For bulk access, the syntax is a bit more convoluted (the reason is efficiency, as discussed below):
with container.get_objects_stream_and_meta([hash3, hash1, hash2]) as triplets:
for hashkey, stream, meta in triplets:
print("Meta for hashkey {}: {}".format(hashkey, meta))
print(" Content: {}".format(stream.read()))
whose output is:
Meta for hashkey 6a96df63699b6fdc947177979dfd37a099c705bc509a715060dbfd3b7b605dbe: {'type': 'packed', 'size': 12, 'pack_id': 0, 'pack_compressed': False, 'pack_offset': 0, 'pack_length': 12}
Content: b'some_content'
Meta for hashkey cfb487fe419250aa790bf7189962581651305fc8c42d6c16b72384f96299199d: {'type': 'packed', 'size': 18, 'pack_id': 0, 'pack_compressed': False, 'pack_offset': 12, 'pack_length': 18}
Content: b'some_other_content'
Meta for hashkey d1e4103ce093e26c63ce25366a9a131d60d3555073b8424d3322accefc36bf08: {'type': 'loose', 'size': 13, 'pack_id': None, 'pack_compressed': None, 'pack_offset': None, 'pack_length': None}
Content: b'third_content'
Important
As you see above, the order of the triplets IS NOT the order in which you passed the hash keys to
get_objects_stream_and_meta. The reason is efficiency: the library will try to keep a (pack) file open as long as possible, and read it in order, to exploit efficiently disk caches.
Memory-savvy approach#
If you don’t know the size of the object, you don’t want to just call stream.read() (you could have just called get_object_content() in that case!) because if the object does not fit in memory, your application will crash.
You will need to read it in chunks and process it chunk by chunk.
A very simple pattern:
# The optimal chunk size depends on your application and needs some benchmarking
CHUNK_SIZE = 100000
with container.get_object_stream(hash1) as stream:
chunk = stream.read(CHUNK_SIZE)
while chunk:
# process chunk here
# E.g. write to a different file, pass to a method to compress it, ...
chunk = stream.read(CHUNK_SIZE)
You can find various examples of this pattern in the utility wrapper classes in disk_objectstore.utils.
Note also that if you use get_objects_stream_and_meta, you can use meta['size'] to know the size
of the object before starting to read, so you can e.g. simply do a .read() if you know the size is small.
Finally, when writing objects, if the objects are big, instead of reading in memory the whole content, you should use
the methods container.add_streamed_object(stream) (loose objects) or add_streamed_objects_to_pack(stream_list)
(directly to packs).