Making backups
Contents
Making backups¶
User instructions¶
A disk-objectstore container is fully contained in its root folder. If the container is not being modified, a backup can be made by just making a copy of this folder. The recommended way is to use the rsync
tool, as the library was designed to be performant with it and make use of the incremental copying capabilities.
However, the preferred way to make a backup, that is also safe while the container is being used (except for when repacking or deleting files), is to use the built-in CLI command:
$ dostore backup --help
Usage: dostore backup [OPTIONS] DEST
Create a backup of the container to destination location DEST, in a
subfolder backup_<timestamp>_<randstr> and point a symlink called `last-
backup` to it.
NOTE: This is safe to run while the container is being used.
NOTE: the symlink `last-backup` is omitted if the filesystem doesn't support
it.
Destination (DEST) can either be a local path, or a remote destination
(reachable via ssh). In the latter case, remote destination needs to have
the following syntax:
[<remote_user>@]<remote_host>:<path>
i.e., contain the remote host name and the remote path, separated by a colon
(and optionally the remote user separated by an @ symbol). You can tune SSH
parameters using the standard options given by OpenSSH, such as adding
configuration options to ~/.ssh/config (e.g. to allow for passwordless login
- recommended, since this script might ask multiple times for the password).
NOTE: 'rsync' and other UNIX-specific commands are called, thus the command
will not work on non-UNIX environments.
Options:
--keep INTEGER Number of previous backups to keep in the destination.
(default: 1)
--rsync-exe TEXT Specify the 'rsync' executable, if not in PATH. Used for
both local and remote destinations.
--verbosity TEXT Set verbosity [silent|info|debug], default is 'info'.
--help Show this message and exit.
Example usage:
$ dostore --path /path/to/container backup /path/to/backup
INFO:Last backup is '/path/to/backup/backup_20231207142602_ymqf', using it for rsync --link-dest.
INFO:Transferred /path/to/container/loose to /path/to/backup/live-backup
INFO:Dumped the SQLite database to /tmp/tmpgewwse3f/packs.idx
INFO:Transferred SQLite database to /path/to/backup/live-backup
INFO:Transferred /path/to/container/packs to /path/to/backup/live-backup
INFO:Backup moved from '/path/to/backup/live-backup' to '/path/to/backup/backup_20231207142913_pz7m'.
INFO:Added symlink 'last-backup' to 'backup_20231207142913_pz7m'.
INFO:Deleted old backup: /path/to/backup/backup_20231207131741_zar7
For more detailed information about how the backup is made, see the next section.
Detailed info/design¶
The primary purpose of the backup functionality is to copy the content of the container in a specific order that prevents data corruption due to the container being updated. This order is the following
loose files;
sqlite database that contains the packed file indexes;
packed files.
To understand why, let’s consider ways the backup could become corrupted:
In the case of packing files (
optimize
) or adding directly packed files, the library first adds data to the pack file and then writes the metadata to the sqlite database. The backup becomes corrupted if the following happensdata is being added to a pack file;
backup copies the pack file, containing the incomplete section;
packfile is completed & the sqlite database is updated;
backup copies the sqlite database.
This results in the backup containing an index that references an incomplete section in a pack file. To prevent this, is to always copy first the sqlite db and then the pack files. This can still result in an incomplete section in the pack files but it effectively doesn’t exist for the backup.
If loose files are packed up at the end, the following might happen:
backup copies pack files & sqlite db;
user runs optimize & clean_storage, which adds loose files to a pack & deletes the original files;
backup copies loose files.
This results in files missing in the backup. Therefore, loose files should be copied first.
Note: one should not run the backup while repacking and deleting files.
Implementation details:
The backup command runs operating-system-level commands on the destination machine by using the python subprocess library. These currently include
running rsync.
For remote destinations, checking if it is accessible (
ssh <remote> exit
);checking if destination path exists (
[ -e <path> ]
);checking if destination directory can be made, if it doesn’t exist (
mkdir <path>
);moving and removing folders.
For 3-5, remote cases just append
ssh <remote>
in front of the command, while rsync is used via its native interface to access a remote destination. For both of these cases of remote access, the standard configuration options of OpenSSH are used (such as configuration in~/.ssh/config
)Steps in order:
Input validation:
is remote accessible?
is
DEST
accessible?is
rsync
executable found?
Check if a backup already exists in
DEST
if yes, use the most recent one (based on timestamp in name) for
rsync --link-dest
argument in allrsync
calls
Create
DEST/live-backup
folderrsync loose folder to
DEST/live-backup
dump sqlite database in a safe manner to a
tempfile.TemporaryDirectory()
rsync the sqlite database to
DEST/live-backup
rsync the packed files to
DEST/live-backup
rsync everything else to
DEST/live-backup
rename
DEST/live-backup
toDEST/backup_<timestamp>_<randstr>
update
DEST/last-backup
symlink to point toDEST/backup_<timestamp>_<randstr>
delete number of previous backups down to
--keep
argument