# Advanced usage

This repository is designed both for performance and for having a low memory footprint.
Therefore, it provides bulk operations and the possibility to access objects as streams.
We **strongly suggest** to use these methods if you use the `disk-objecstore` as a library,
unless you are absolutely sure that objects always fit in memory, and you never have to
access tens of thousands of objects or more.

## Bulk access

We continue from the commands of the basic usage. We can get the content of more objects at once:

```python
container.get_objects_content([hash1, hash2])
# Output: {'6a96df63699b6fdc947177979dfd37a099c705bc509a715060dbfd3b7b605dbe': b'some_content',  'cfb487fe419250aa790bf7189962581651305fc8c42d6c16b72384f96299199d': b'some_other_content'}
```

For many objects (especially if they are packed), retrieving in bulk can give orders-of-magnitude speed-up.

## Using streams

### Interface

First, let's look at the interface:

```python
with container.get_object_stream(hash1) as stream:
    print(stream.read())
# Output: b'some_content'
```

For bulk access, the syntax is a bit more convoluted (the reason is efficiency, as discussed below):

```python
with container.get_objects_stream_and_meta([hash3, hash1, hash2]) as triplets:
    for hashkey, stream, meta in triplets:
        print("Meta for hashkey {}: {}".format(hashkey, meta))
        print("  Content: {}".format(stream.read()))
```

whose output is:

```
Meta for hashkey 6a96df63699b6fdc947177979dfd37a099c705bc509a715060dbfd3b7b605dbe: {'type': 'packed', 'size': 12, 'pack_id': 0, 'pack_compressed': False, 'pack_offset': 0, 'pack_length': 12}
  Content: b'some_content'
Meta for hashkey cfb487fe419250aa790bf7189962581651305fc8c42d6c16b72384f96299199d: {'type': 'packed', 'size': 18, 'pack_id': 0, 'pack_compressed': False, 'pack_offset': 12, 'pack_length': 18}
  Content: b'some_other_content'
Meta for hashkey d1e4103ce093e26c63ce25366a9a131d60d3555073b8424d3322accefc36bf08: {'type': 'loose', 'size': 13, 'pack_id': None, 'pack_compressed': None, 'pack_offset': None, 'pack_length': None}
  Content: b'third_content'
```

```{important}
As you see above, the order of the triplets **IS NOT** the order in which you passed the hash keys to
`get_objects_stream_and_meta`. The reason is efficiency: the library will try to keep a (pack) file open as long as possible, and read it in order, to exploit efficiently disk caches.
```

### Memory-savvy approach

If you don't know the size of the object, you don't want to just call `stream.read()` (you could have just called `get_object_content()` in that case!) because if the object does not fit in memory, your application will crash.
You will need to read it in chunks and process it chunk by chunk.

A very simple pattern:

```python
# The optimal chunk size depends on your application and needs some benchmarking
CHUNK_SIZE = 100000
with container.get_object_stream(hash1) as stream:
    chunk = stream.read(CHUNK_SIZE)
    while chunk:
        # process chunk here
        # E.g. write to a different file, pass to a method to compress it, ...
        chunk = stream.read(CHUNK_SIZE)
```

You can find various examples of this pattern in the utility wrapper classes in `disk_objectstore.utils`.

Note also that if you use `get_objects_stream_and_meta`, you can use `meta['size']` to know the size
of the object before starting to read, so you can e.g. simply do a `.read()` if you know the size is small.

Finally, when writing objects, if the objects are big, instead of reading in memory the whole content, you should use
the methods `container.add_streamed_object(stream)` (loose objects) or `add_streamed_objects_to_pack(stream_list)`
(directly to packs).