Duplicacy: A lock-free deduplication cloud backup tool

Last update: Jan 8, 2023

Comments: 17

Duplicacy: A lock-free deduplication cloud backup tool

Duplicacy is a new generation cross-platform cloud backup tool based on the idea of Lock-Free Deduplication.

Our paper explaining the inner workings of Duplicacy has been accepted by IEEE Transactions on Cloud Computing and will appear in a future issue this year. The final draft version is available here for those who don't have IEEE subscriptions.

This repository hosts source code, design documents, and binary releases of the command line version of Duplicacy. There is also a Web GUI frontend built for Windows, macOS, and Linux, available from https://duplicacy.com.

There is a special edition of Duplicacy developed for VMware vSphere (ESXi) named Vertical Backup that can back up virtual machine files on ESXi to local drives, network or cloud storages.

Features

There are 3 core advantages of Duplicacy over any other open-source or commercial backup tools:

Duplicacy is the only cloud backup tool that allows multiple computers to back up to the same cloud storage, taking advantage of cross-computer deduplication whenever possible, without direct communication among them. This feature turns any cloud storage server supporting only a basic set of file operations into a sophisticated deduplication-aware server.
Unlike other chunk-based backup tools where chunks are grouped into pack files and a chunk database is used to track which chunks are stored inside each pack file, Duplicacy takes a database-less approach where every chunk is saved independently using its hash as the file name to facilitate quick lookups. The avoidance of a centralized chunk database not only produces a simpler and less error-prone implementation, but also makes it easier to develop advanced features, such as Asymmetric Encryption for stronger encryption and Erasure Coding for resilient data protection.
Duplicacy is fast. While the performance wasn't the top-priority design goal, Duplicacy has been shown to outperform other backup tools by a considerable margin, as indicated by the following results obtained from a benchmarking experiment backing up the Linux code base using Duplicacy and 3 other open-source backup tools.

Getting Started

Storages

Duplicacy currently provides the following storage backends:

Local disk
SFTP
Dropbox
Amazon S3
Wasabi
DigitalOcean Spaces
Google Cloud Storage
Microsoft Azure
Backblaze B2
Google Drive
Microsoft OneDrive
Hubic
OpenStack Swift
WebDAV (under beta testing)
pcloud (via WebDAV)
Box.com (via WebDAV)
File Fabric by Storage Made Easy

Please consult the wiki page on how to set up Duplicacy to work with each cloud storage.

For reference, the following chart shows the running times (in seconds) of backing up the Linux code base to each of those supported storages:

For complete benchmark results please visit https://github.com/gilbertchen/cloud-storage-comparison.

Comparison with Other Backup Tools

duplicity works by applying the rsync algorithm (or more specific, the librsync library) to find the differences from previous backups and only then uploading the differences. It is the only existing backup tool with extensive cloud support -- the long list of storage backends covers almost every cloud provider one can think of. However, duplicity's biggest flaw lies in its incremental model -- a chain of dependent backups starts with a full backup followed by a number of incremental ones, and ends when another full backup is uploaded. Deleting one backup will render useless all the subsequent backups on the same chain. Periodic full backups are required, in order to make previous backups disposable.

bup also uses librsync to split files into chunks but save chunks in the git packfile format. It doesn't support any cloud storage, or deletion of old backups.

Duplicati is one of the first backup tools that adopt the chunk-based approach to split files into chunks which are then uploaded to the storage. The chunk-based approach got the incremental backup model right in the sense that every incremental backup is actually a full snapshot. As Duplicati splits files into fixed-size chunks, deletions or insertions of a few bytes will foil the deduplication. Cloud support is extensive, but multiple clients can't back up to the same storage location.

Attic has been acclaimed by some as the Holy Grail of backups. It follows the same incremental backup model like Duplicati but embraces the variable-size chunk algorithm for better performance and higher deduplication efficiency (not susceptible to byte insertion and deletion any more). Deletions of old backup are also supported. However, no cloud backends are implemented. Although concurrent backups from multiple clients to the same storage is in theory possible by the use of locking, it is not recommended by the developer due to chunk indices being kept in a local cache. Concurrent access is not only a convenience; it is a necessity for better deduplication. For instance, if multiple machines with the same OS installed can back up their entire drives to the same storage, only one copy of the system files needs to be stored, greatly reducing the storage space regardless of the number of machines. Attic still adopts the traditional approach of using a centralized indexing database to manage chunks and relies heavily on caching to improve performance. The presence of exclusive locking makes it hard to be extended to cloud storages.

restic is a more recent addition. It uses a format similar to the git packfile format. Multiple clients backing up to the same storage are still guarded by locks, and because a chunk database is used, deduplication isn't real-time (different clients sharing the same files will upload different copies of the same chunks). A prune operation will completely block all other clients connected to the storage from doing their regular backups. Moreover, since most cloud storage services do not provide a locking service, the best effort is to use some basic file operations to simulate a lock, but distributed locking is known to be a hard problem and it is unclear how reliable restic's lock implementation is. A faulty implementation may cause a prune operation to accidentally delete data still in use, resulting in unrecoverable data loss. This is the exact problem that we avoided by taking the lock-free approach.

The following table compares the feature lists of all these backup tools:

Feature/Tool	duplicity	bup	Duplicati	Attic	restic	Duplicacy
Incremental Backup	Yes	Yes	Yes	Yes	Yes	Yes
Full Snapshot	No	Yes	Yes	Yes	Yes	Yes
Compression	Yes	Yes	Yes	Yes	No	Yes
Deduplication	Weak	Yes	Weak	Yes	Yes	Yes
Encryption	Yes	Yes	Yes	Yes	Yes	Yes
Deletion	No	No	Yes	Yes	No	Yes
Concurrent Access	No	No	No	Not recommended	Exclusive locking	Lock-free
Cloud Support	Extensive	No	Extensive	No	Limited	Extensive
Snapshot Migration	No	No	No	No	No	Yes

License

Free for personal use or commercial trial
Non-trial commercial use requires per-computer CLI licenses available from duplicacy.com at a cost of $50 per year
The computer with a valid commercial license for the GUI version may run the CLI version without a CLI license
CLI licenses are not required to restore or manage backups; only the backup command requires valid CLI licenses
Modification and redistribution are permitted, but commercial use of derivative works is subject to the same requirements of this license

Owner

https://github.com/gilbertchen/duplicacy https://duplicacy.com

Comments

bug: restoring from gdrive on gui: rate limit exceeded with 64 threads

I was testing the gui version (2.0.8) in a vm, trying to restore with 64 threads some 20+ GB of stuff, 30k folders, 77k files (linux kernel, photos, videos + duplicates of those -- all the general file sizes), and a few minutes after starting restore i got an error about rate limit exceeded, and the restore stopped.

"error failed to find the chuck : googleapi: error 403: user rate limit exceeded, userRateLimitExceeded."

From what i remembered this error was fixed in 2.0.6, right?
Failed to upload the chunk / connection timed out errors using Azure

While running the initial backup from MacOS and Linux computers running 2.0.10 to Azure storage, the process eventually fails with a message like this:

Failed to upload the chunk f041b71f8ae1e436ea2e6ce647c0491426a3b146bd2538308cf186fcea5d3e48: Put https://[my storage account].blob.core.windows.net/[my storage container]/chunks/f0/41b71f8ae1e436ea2e6ce647c0491426a3b146bd2538308cf186fcea5d3e48: read tcp 192.168.1.145:59117->52.239.153.4:443: read: connection timed out Incomplete snapshot saved to /mnt/USBHD/.duplicacy/incomplete

Whenever this happens I just restart the backup process again. Eventually, it finishes. The problem is that these are the initial backups so they are huge and instead of it just running for a day, it's taking forever because it fails after an hour or two and doesn't get started up again until I notice it.

I am not sure if this happens on Windows or not because I haven't tried from a Windows PC yet but I assume it does. I am also not sure if this only happens when writing to Azure storage or if it affects other clients too.
'cipher: message authentication failed' when downloading chunks from wasabi
'duplicacy copy -from wasabi -to local -threads 16' frequently crashes with this message:

2017-09-20 19:27:10.262 ERROR UPLOAD_CHUNK Failed to decrypt the chunk da7f61e4ba27912563f37d1a1849479dac940cd85e60dc5b8068528df8eb889e: cipher: message authentication failed

Sometimes it crashes after less than 100 chunks, sometimes it crashes after more than 1000, but it will reliably crash. It's a different chunk every time.

The same behaviour occurs with both the s3 and s3c providers. v2.0.9 is affected. So is HEAD (ae44bf722696e1a33e311ffe6729eebf23fca755). Linux and FreeBSD are both impacted.
Ad-hoc style backups

Currently I'm using https://github.com/restic/restic for backups, and it's just great. But I'm also looking at Duplicacy as a secondary backup tool.

However, it seems to me that Duplicacy isn't tailored for "ad-hoc" backups, in the sense that I can't just initalize the remote storage and then start backing up any folder I want to it the same way I do with restic.

For example, with restic I initialize the remote storage with restic -r sftp://HOST/backup init (for the SFTP backend), and then I can start backing up whatever files and folders I want to using e.g. restic -r sftp://HOST/backup backup ~/work /any/other/folder. Restic will store a cache on my local machine, but that's it - there's no need to initialize anything locally. I can be anywhere in the filesystem and call restic telling it to back up any path I want in the filesystem.

But with Duplicacy it seems I have to decide a local folder where I want the "repository" to be stored, and then I can run backups in/for that local folder/repository using the backup command (which doesn't seem to take files/folders as argument), but without being able to specify which folders I want to back up this and that backup run. I suppose that I can use excludes/includes to control which files and folders under this local repository are effectively backed up, but that seems a bit messy to me. I would rather just say what data I want to back up, when I'm backing up.

So is it possible to be more "ad-hoc" with Duplicacy, specifying data to be backed up when you actually back up? Without includes/excludes, can I specify what folders to back up on each run? Can i skip initializing a local repository entirely, considering I have no need for it (I'm happy to specify the target/remote repository, as well as what to back up, as well as the password on each backup run - the latter can be dealt with in restic by environment variables or a password file)?

I suppose I can initialize the local repository at / but that's not a folder my non-privileged user have access to, so it seems rather silly to have to do that. Also, initializing one local repository for each folder I want to back up is pretty silly as well, because I usually just back up the folders I have had some changes in that day (one day it's the Economy folder, the other day it's some development projects, another day some other project, etc).

EDIT: Some may wonder why I don't just back everything up every time, and it's pretty simple; If I only changed a couple of files in one place, it makes most sense to me to

The issue #148 seems to be a similar discussion, albeit not entirely the same one.
default to single-dir-nesting for local and SFTP storages

As per https://github.com/gilbertchen/duplicacy/issues/163#issuecomment-331242178

[...] listing (eventually) 65k dirs over SFTP, with a full roundtrip for each one, would just take forever (it's already annoying enough within the LAN). Single level nesting would result in ~1k files per dir with a 1TB repo with the default chunk size which hardly seems overwhelming - and would cut down dirlist requests to 256, or about 5 minutes, which is far more reasonable.

Quick and informal testing (with a tiny repo) shows that backup, copy, prune, list and check keep working as expected, i.e. single-nested (new) chunks can coexist alongside preexisting double-nested chunks.
Install to running duplicacy

I am not sure if this is the right place but please help, I can't figure this out. How do I get from

pkg install wget pkg install go go get -u github.com/gilbertchen/duplicacy/... go build -v /root/go/src/github.com/gilbertchen/duplicacy/duplicacy/duplicacy_main.go chmod +x /root/go/bin/duplicacy

To running duplicacy.

I am running this on a FreenNAS ver-11.1-U5 jail.

----edit----

Ok got it running by downloading the prebuilt exe. Here are the steps.

pkg install wget Needed to download a file on my system. You may have it installed already.
pkg install go. Not sure if I needed it but I got it anyway. mkdir /root/go/ mkdir /root/go/bin/ cd /root/go/bin/ wget https://github.com/gilbertchen/duplicacy/releases/download/v2.1.0/duplicacy_freebsd_x64_2.1.0 mv duplicacy_freebsd_x64_2.1.0 duplicacy Changes name to duplicacy. The .0 at the end of the original file was preventing it from running not sure why. chmod +x duplicacy Makes the file executable. ./duplicacy Runs the file in this directory.
/root/go/bin/duplicacy Runs the file in any other directory.
Discourse forum

This has been mentioned a few times on the duplicacy website, so i though i could add it here as well.

I would mention the issues, but there's no search in there :oh-darn:.

In case @gilbertchen is interested, i have a testing discourse running on my own pc here, with some interesting plugins, and it only took me 30 minutes to get it up and running along with LetsEncrypt certificate and https only.

My pc is not always on though, so you may or may not need to check the website from time to time 😵, however the discourse meta is always available and has a lot helpful folks (including, of course the discourse creators -- they're really nice!).

I have also seen @tophee hanging around on meta, and i also assume he knows discourse better than i do.

If @gilbertchen wants, i'd be willing to have a go at configuring, at least for testing, a discourse instance (and eventually try and see how sso works with duplicacy's own website).
B2 efficiency
Let me preface this by saying I hope I got all my math right. Corrections are welcome. I'm attaching the python script I used for calculations and a table of those results.

tl;dr: b2_list_file_names is inefficient to check for existing chunks. Instead use b2_download_file_by_name with a header "Range: bytes=0-0".

Duplicacy calls b2_list_file_names prior to every upload to check if a given chunk has already been uploaded. As referenced in #30, the B2 pricing means around 10 GB can be uploaded for free each day, and then $.001/GB. However, the list query only returns up to 1000 entries each call so multiple calls per chunk must be made once more than 1000 chunks have been uploaded. 1000 chunks works out to around only 4 GB. By the time I reach 1 TB of data (around 263000 chunks) it's going to be 263 calls to b2_list_file_names for each new chunk. Ignoring the free data and transactions (negligible given the total data) it's going to cost almost $140 to make the 35 million calls while uploading. 10 TB would require 2.5 million chunks, 3.5 billion total calls, cost $13,750, and require another 2622 list calls just to upload another new chunk. That's over a penny every 4 MB.

Now, there is the b2_get_file_info call, but that requires passing the fileId which is normally obtained from b2_list_file_names. Duplicacy could save the fileId in the snapshots files, but that would require an architecture change.

Alternatively, Duplicacy could call b2_download_file_by_name and look at the response code (200 vs 400) and headers. A GET or HEAD adds the entire payload size to your download quota, but including a Range header only adds the requested Range size. Setting a "Range: bytes=0-0" is only a single byte transferred. This reduces the cost by an order of magnitude and also is guaranteed to require a single call, rather than multiple calls. This reduces the cost to upload 10 TB from $13,750 to $1.05. The additional single byte (GET or HEAD) equals around 2 billion calls for a penny. Conveniently, the response headers include the relevant info also returned by b2_get_file_info. It looks like the only real difference is a lack of the bucket id.

On the more hacky side, b2_hide_file and b2_delete_file_version are each unbilled calls. These could be combined to test for the existence of a file. I can see an issue if the process dies before unhiding the chunk and it looks like Duplicacy uses B2 hidden files for fossils. I think this option is more trouble than it's worth.

More generally, B2 chunks are all stored in the "chunks" prefix. If these were instead broken up by nested prefixes taken from the hash, the b2_list_file_names could be made more stable by using the relevant prefix when calling. In other words, chunk/badcoffee12345 would instead be stored at chunk/ba/dc/of/fe/e1/23/45. Each prefix, and therefore each call to b2_list_file_names, would hold no more than 256 files. 10 TB would now cost $10.49. Switching to b2_download_by_name would be a tenth further, but wouldn't benefit any further than using the download call with all chunks in the same directory.

Changing the file structure would break old backups, but perhaps it could be implemented as new storage target.

I also sent a message to Backblaze to ask about new API calls that could be used to check for file existence. Specifically:

b2_get_file_info_by_name

b2_get_file_id_by_name I received a response that it's been added as a feature request.

b2-calc.txt b2-calc.py.txt

wasabi backend fails with "Failed to fossilize chunk" when pruning (fixed with pull request)

Using dfdbfed on Ubuntu 16.04.4.

When pruning, the Wasabi backend consistently fails with "Failed to fossilize chunk". It reports a 404 Not Found, however the chunk is present on the storage backend.

Executing the same command again repeats the process, except failing on fossilizing a different chunk. Each time fails on a different chunk, with the previous chunks remaining orphaned on storage.

Swapping back to the s3 backend and the prune completed without errors.

Attached is a full log showing repeated consecutive invocations failing on different chunks, plus the final invocation with the s3 backend which was successful on first attempt.

Here's an excerpt from one of the failed runs:

# duplicacy -d -log prune -keep 1:1 -keep 5:7 -keep 15:30 -keep 30:180 -keep 0:360 | tee -a /var/log/duplicacy.log
2018-07-11 21:20:03.673 INFO STORAGE_SET Storage set to wasabi://[email protected]/redacted
2018-07-11 21:20:03.673 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_WASABI_KEY
2018-07-11 21:20:03.673 DEBUG PASSWORD_PREFERENCE Reading wasabi_key from preferences
2018-07-11 21:20:03.673 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_WASABI_SECRET
2018-07-11 21:20:03.673 DEBUG PASSWORD_PREFERENCE Reading wasabi_secret from preferences
2018-07-11 21:20:03.673 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_WASABI_KEY
2018-07-11 21:20:03.674 DEBUG PASSWORD_PREFERENCE Reading wasabi_key from preferences
2018-07-11 21:20:03.674 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_WASABI_SECRET
2018-07-11 21:20:03.674 DEBUG PASSWORD_PREFERENCE Reading wasabi_secret from preferences
2018-07-11 21:20:03.674 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2018-07-11 21:20:03.674 DEBUG PASSWORD_PREFERENCE Reading password from preferences
2018-07-11 21:20:03.982 TRACE CONFIG_ITERATIONS Using 16384 iterations for key derivation
2018-07-11 21:20:04.053 DEBUG STORAGE_NESTING Chunk read levels: [1], write level: 1
2018-07-11 21:20:04.067 INFO CONFIG_INFO Compression level: 100
2018-07-11 21:20:04.067 INFO CONFIG_INFO Average chunk size: 4194304
2018-07-11 21:20:04.067 INFO CONFIG_INFO Maximum chunk size: 16777216
2018-07-11 21:20:04.067 INFO CONFIG_INFO Minimum chunk size: 1048576
2018-07-11 21:20:04.067 INFO CONFIG_INFO Chunk seed: 50a4d4a8aaa118dbbceed243e071d24fd799d32ba31efe8f033a391fe1d2f1c7
2018-07-11 21:20:04.067 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2018-07-11 21:20:04.067 DEBUG PASSWORD_PREFERENCE Reading password from preferences
2018-07-11 21:20:04.067 DEBUG DELETE_PARAMETERS id: backup, revisions: [], tags: [], retentions: [1:1 5:7 15:30 30:180 0:360], exhaustive: false, exclusive: false, dryrun: false, deleteOnly: false, collectOnl
y: false
2018-07-11 21:20:04.068 INFO RETENTION_POLICY Keep 1 snapshot every 1 day(s) if older than 1 day(s)
2018-07-11 21:20:04.068 TRACE SNAPSHOT_LIST_IDS Listing all snapshot ids
2018-07-11 21:20:04.113 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot backup
2018-07-11 21:20:04.198 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/backup/1 from the snapshot cache
2018-07-11 21:20:04.239 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/backup/2 from the snapshot cache
2018-07-11 21:20:04.285 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/backup/3 from the snapshot cache
2018-07-11 21:20:04.326 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/backup/4 from the snapshot cache
2018-07-11 21:20:04.367 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/backup/5 from the snapshot cache
2018-07-11 21:20:04.409 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/backup/6 from the snapshot cache
2018-07-11 21:20:04.410 DEBUG SNAPSHOT_DELETE Snapshot backup at revision 2 to be deleted - older than 1 days, less than 1 days from previous
2018-07-11 21:20:04.410 INFO SNAPSHOT_DELETE Deleting snapshot backup at revision 2
2018-07-11 21:20:04.424 DEBUG CHUNK_CACHE Chunk 1b3bdb3229746baa88efe5c48cf6189fbe572e863d63ab56fb2679ab580b579a has been loaded from the snapshot cache
2018-07-11 21:20:04.444 DEBUG DOWNLOAD_FETCH Fetching chunk 1ce42d419b9ecaef78f4760bd5180443084fa7eca8e8c6eba174ea3a8ba1fe6f
2018-07-11 21:20:04.445 DEBUG CHUNK_CACHE Chunk 1ce42d419b9ecaef78f4760bd5180443084fa7eca8e8c6eba174ea3a8ba1fe6f has been loaded from the snapshot cache
2018-07-11 21:20:04.447 DEBUG DOWNLOAD_FETCH Fetching chunk 7c3f4844cd94b75c0b969257775b7bc2fb1a132c9d93446a350ff088f99d2b2c
2018-07-11 21:20:04.452 DEBUG CHUNK_CACHE Chunk 7c3f4844cd94b75c0b969257775b7bc2fb1a132c9d93446a350ff088f99d2b2c has been loaded from the snapshot cache
2018-07-11 21:20:04.572 DEBUG DOWNLOAD_FETCH Fetching chunk 54ea3d5845e7687fef56a098199a8db2950c1c7d1796754dcae341b82d83b5f2
2018-07-11 21:20:04.577 DEBUG CHUNK_CACHE Chunk 54ea3d5845e7687fef56a098199a8db2950c1c7d1796754dcae341b82d83b5f2 has been loaded from the snapshot cache
2018-07-11 21:20:04.579 DEBUG DOWNLOAD_FETCH Fetching chunk a14efa66b126ee24aa13ea39960eca7a4b710a128d898c9ddffce5d57f3b3a2d
2018-07-11 21:20:04.580 DEBUG CHUNK_CACHE Chunk a14efa66b126ee24aa13ea39960eca7a4b710a128d898c9ddffce5d57f3b3a2d has been loaded from the snapshot cache
2018-07-11 21:20:04.684 DEBUG DOWNLOAD_FETCH Fetching chunk 54ea3d5845e7687fef56a098199a8db2950c1c7d1796754dcae341b82d83b5f2
2018-07-11 21:20:04.689 DEBUG CHUNK_CACHE Chunk 54ea3d5845e7687fef56a098199a8db2950c1c7d1796754dcae341b82d83b5f2 has been loaded from the snapshot cache
2018-07-11 21:20:04.689 DEBUG DOWNLOAD_FETCH Fetching chunk 2bceac3d48bef096404bfe4a7c4645701603b9738198fdae0c867d6d4ea6a3ff
2018-07-11 21:20:04.690 DEBUG CHUNK_CACHE Chunk 2bceac3d48bef096404bfe4a7c4645701603b9738198fdae0c867d6d4ea6a3ff has been loaded from the snapshot cache
2018-07-11 21:20:04.794 DEBUG DOWNLOAD_FETCH Fetching chunk 54ea3d5845e7687fef56a098199a8db2950c1c7d1796754dcae341b82d83b5f2
2018-07-11 21:20:04.799 DEBUG CHUNK_CACHE Chunk 54ea3d5845e7687fef56a098199a8db2950c1c7d1796754dcae341b82d83b5f2 has been loaded from the snapshot cache
2018-07-11 21:20:04.800 DEBUG DOWNLOAD_FETCH Fetching chunk ccf9a9c0085d1cf160f6437c71a47f3b6e6560d501ec049d232038a44ff6f6d8
2018-07-11 21:20:04.800 DEBUG CHUNK_CACHE Chunk ccf9a9c0085d1cf160f6437c71a47f3b6e6560d501ec049d232038a44ff6f6d8 has been loaded from the snapshot cache
2018-07-11 21:20:05.192 ERROR CHUNK_DELETE Failed to fossilize the chunk 515b226b6f533991a99884759bd428e46a1d36aa66aca71897490e95ded33a70: 404 Not Found

prune.log

Access denied listing subdirectory on Windows 10

Running a backup on Windows 10, duplicacy 1.1.6 is giving the following error many times for different paths:

Failed to list subdirectory: 
open C:\Users\martin/.duplicacy\shadow\\Users\martin/data/identities/personal/project/mithril-boilerplate/node_modules/watchify/node_modules/browserify/node_modules/insert-module-globals/node_modules/lexical-scope/node_modules/astw/node_modules/esprima-fb/examples: 
Access is denied.

Is this a long path issue?

Note: In an Admin shell, backup executed using C:\Users\martin>duplicacy backup -stats -vss

support for metadata, Mac extended attributes, etc.

I was originally just concerned with MacOS resource forks, but was reminded of the Backup Bouncer test suite.

Macs currently split non-data fork data to AppleDouble encoded "._filename" files when saving to non-Mac systems.

These forks can be listed and displayed with the xattr command.

The easiest way to create a file with a resource fork is dragging a URL or text selection to the desktop to create a "webloc" or "textClipping" file.
Rework host authentication for SFTP storage backends
https://github.com/gilbertchen/duplicacy/blob/0a794e6feae91cf0e6511b77a5bf160999c1838e/src/duplicacy_storage.go#L165-L211

Just something I stumbled across trying to understand how duplicacy verifies the host it is connecting to.

This code is not fully compatible with the standard definition of the SSH known_hosts file syntax [1]. Perhaps it's not meant to be, but then the name of the file is misleading. While the code appears to be able to account for markers, it does not seem to be able to account for the brackets around the host name, in case a non-standard port is provided. Also if the standard port :22 is used, it is not enforced that it is omitted in the config file.

I'd argue that duplicacy should not automatically connect to hosts for which it has never seen a fingerprint before, without explicit user consent. Right now, a MITM could be happening during the first use of duplicacy, leading to exposed data.

Maybe consider taking into account ~/.ssh/known_hosts as well in addition to .duplicacy/known_hosts?

[1] https://man.openbsd.org/sshd#SSH_KNOWN_HOSTS_FILE_FORMAT
Security issue in encryption key derivation?

I was trying to understand the encryption code in duplicacy and its handling of the many keys stored in the config file when I discovered something unexpected here:

https://github.com/gilbertchen/duplicacy/blob/f2d6de3fff7567740e86cb49801b0698f085cc59/src/duplicacy_chunk.go#L216-L218

For some reason this code is using the 'derivation key', which for things like snapshots is just a plaintext file path, as the secret key for Blake2b, and then digesting the encryptionKey to get the combined key. I would have expected the opposite, using the encryptionKey as the secret key for Blake2b, and then digesting the path. Indeed the wiki page on encryption here https://github.com/gilbertchen/duplicacy/wiki/Encryption states:

"The snapshot is encrypted by AES-GCM too, using an encrypt key that is the HMAC-SHA256 of the file path with the File Key as the secret key."

Which is precisely the opposite of what the code actually does. Worse, it looks like this may have led to issues in the past, specifically this commit https://github.com/gilbertchen/duplicacy/commit/d330f61d251f12c24cdd38b77d143cbb716913da - which would never have been an issue if the construction wasn't backwards.

I am not sure how exploitable this is, I hope it isn't, but it's a pretty big code smell in the middle of the encryption code.
Live backup preview (mount)
This work is inspired by !628, but it's a more complete implementation.

I'm using cgofuse so it works on all platforms, including Windows.

There are two commands:

mount <MOUNT_POINT>: must be in a repository directory, uses preferences information to load snapshot info.

mount-storage <STORAGE_URL> <MOUNT_POINT>: mounts a storage directly without the need for a repository. note that it may need a repository under some circumstances anyway, so there's a -repository option for that.

Snapshot/revision information is loaded only when you try to browse the containing folder. By default it'll create a base tree organizing snapshot revisions by date using the format /%YYYY/%MM/%DD/%HH%mm.%REV. For instance, let's say you have this output from duplicacy list:

$ duplicacy list Snapshot Users revision 1 created at 2022-10-19 20:09 -hash Snapshot Users revision 2 created at 2022-10-19 21:38 Snapshot Users revision 3 created at 2022-10-20 10:38 Snapshot Users revision 4 created at 2022-11-14 13:44

The following base tree will be created:

2022/ 10/ 19/ 2009.1/ 2138.2/ 20/ 1038.3/ 11/ 14/ 1344.4/

If you want all revisions to be on the same directory, you can use the -flat parameter. Revision and snapshot directories have an extra empty directory to prevent Windows explorer from prematurely triggering the downloading of data from the storage.

Folders and files will display their saved attributes, with the caveat that everything that's not a folder is shown as a regular file.

File reading is efficiently implemented by only downloading the chunks needed for the specific OS read request and caching them. The cache is a 2Q LRU cache by golang-lru with size 20 and keyed by the chunk hash.

I've tested using a repository with these characteristics:

10.2 GB

70151 Files

10467 Folders

4 revisions

sftp storage on a remote server

It takes a couple of seconds to create the base tree and a few seconds for every revision dir that is loaded for the first time, but otherwise everything works as expected. Memory use hovered around 700MB, even tarring a whole revision dir and piping it to /dev/null.
kind duplicacy and other tools benchmark report
Hello,

I'm currently doing benchmarks for deduplication backup tools, including duplicacy. I decided to write a script that would:

Install the backup programs

Prepare the source server

Prepare local targets / remote targets

Run backup and restore benchmarks

Use public available data (linux kernel sources as git repo) and checkout various git tags to simulate user changes in the dataset

The idea of the script would be to have reproductible results, the only changing factor being the machine specs & network link between sources and targets.

So far, I've run two sets of benchmarks, each done locally and remotely. You can find the results at https://github.com/deajan/backup-bench

I'd love you to review the recipe I used for duplicacy, and perhaps guide me on what parameters to use to get maximum performance. Any remarks / ideas / PRs are welcome.

I've also made a comparaison table of some features of the backup solutions I'm benchmarking. I still miss some informations for some of the backup programs. Would you mind having a look at the comparaison table and fill the question marks related to the features of duplicacy ? Also, if duplicacy has an interesting feature I didn't list, I'll be happy to extend the comparaison.

PS: I'm trying to be as unbiased as possible when it comes to those benchmarks, please forgive me if I didn't treat your program with the parameters it deserves.

Also, I've created the same issue in every git repo of the backup tools I'm testing, so every author / team / community member can judge / improve the instructions for better benchmarking.
Duplicacy, Btrfs (Streams) & Default config directory

As discussed on the forum: Duplicacy and btrfs snapshots and quesiton 2 of ... Have 3 Questions.

I would like to be able to do something akin to btrfs send | duplicacy. That way I can use the deduplication, compression, and encryption mechanisms of duplicacy. This is similar to btrbk, which effectly does btrfs send | zstd | gpg | ssh, but better.

I would also like to suggest duplicacy default to $HOME for the .duplicacy folder, if there is none in the present directory. And/or, add flags like -backup-dir and -snapshot-id to specify.
Sharepoint support

Added support for SharePoint document libraries to the ODB backend. Used the same format as for Team Drives in GCD, so odb://DRIVEID@path/to/storage, where DRIVEID is in the format "b!xxxxxxx". If not using @ in the odb:// storage specification the behaviour should be exactly the same as before, so there are no changes to existing setups.

No to minimal changes needed for the GUI as pathspec can go into the existing field (filtering may need to be adjusted).

This pull request is built on top of custom_odb_creds PR - not because they have dependencies on each other (they don't), but because they modify the same files so merge would have been non-trivial.

Related tags

Cloud Computing duplicacy

Cloudpods is a cloud-native open source unified multi/hybrid-cloud platform developed with Golang

Cloudpods is a cloud-native open source unified multi/hybrid-cloud platform developed with Golang, i.e. Cloudpods is a cloud on clouds. Cloudpods is able to manage not only on-premise KVM/baremetals, but also resources from many cloud accounts across many cloud providers. It hides the differences of underlying cloud providers and exposes one set of APIs that allow programatically interacting with these many clouds.

Jan 11, 2022

Contentrouter - Protect static content via Firebase Hosting with Cloud Run and Google Cloud Storage

contentrouter A Cloud Run service to gate static content stored in Google Cloud

Jan 2, 2022

Lightweight Cloud Instance Contextualizer

Flamingo Flamingo is a lightweight contextualization tool that aims to handle initialization of cloud instances. It is meant to be a replacement for c

Jun 18, 2022

Go language interface to Swift / Openstack Object Storage / Rackspace cloud files (golang)

Swift This package provides an easy to use library for interfacing with Swift / Openstack Object Storage / Rackspace cloud files from the Go Language

Nov 9, 2022

The extensible SQL interface to your favorite cloud APIs.

Jan 4, 2023

Terraform provider for HashiCorp Cloud Platform.

HashiCorp Cloud Platform (HCP) Terraform Provider Requirements Terraform >= 0.12.x Go >= 1.14 Building The Provider Clone the repository Enter the rep

Dec 25, 2022

The Cloud Posse Terraform Provider for various utilities (E.g. deep merging)

terraform-provider-utils Terraform provider to add additional missing functionality to Terraform This project is part of our comprehensive "SweetOps"

Jan 7, 2023

Cloud cost estimates for Terraform in your CLI and pull requests 💰📉

Infracost shows cloud cost estimates for Terraform projects. It helps developers, devops and others to quickly see the cost breakdown and compare different options upfront.

Jan 2, 2023

Cloud-native way to provide elastic Jupyter Notebook services on Kubernetes

elastic-jupyter-operator: Elastic Jupyter on Kubernetes Kubernetes 原生的弹性 Jupyter 即服务介绍为用户按需提供弹性的 Jupyter Notebook 服务。elastic-jupyter-operator 提供以下特性

Dec 29, 2022

Google Cloud Client Libraries for Go.

Jan 8, 2023

A Cloud Native Buildpack for Go

The Go Paketo Buildpack provides a set of collaborating buildpacks that enable the building of a Go-based application.

Dec 14, 2022

cloud-native local storage management system

Open-Local是由多个组件构成的本地磁盘管理系统，目标是解决当前 Kubernetes 本地存储能力缺失问题。通过Open-Local，使用本地存储会像集中式存储一样简单。

Dec 30, 2022

Fleex allows you to create multiple VPS on cloud providers and use them to distribute your workload.

Fleex allows you to create multiple VPS on cloud providers and use them to distribute your workload. Run tools like masscan, puredns, ffuf, httpx or anything you need and get results quickly!

Jan 6, 2023

☁️🏃 Get up and running with Go on Google Cloud.

Get up and running with Go and gRPC on Google Cloud Platform, with this lightweight, opinionated, batteries-included service SDK.

Dec 20, 2022

Elkeid is a Cloud-Native Host-Based Intrusion Detection solution project to provide next-generation Threat Detection and Behavior Audition with modern architecture.

Dec 30, 2022

Duplicacy: A lock-free deduplication cloud backup tool

Duplicacy: A lock-free deduplication cloud backup tool

Features

Getting Started

Storages

Comparison with Other Backup Tools

License

Owner

Comments

bug: restoring from gdrive on gui: rate limit exceeded with 64 threads

Failed to upload the chunk / connection timed out errors using Azure

'cipher: message authentication failed' when downloading chunks from wasabi

Ad-hoc style backups

default to single-dir-nesting for local and SFTP storages

Install to running duplicacy

Discourse forum

B2 efficiency

wasabi backend fails with "Failed to fossilize chunk" when pruning (fixed with pull request)

Access denied listing subdirectory on Windows 10

support for metadata, Mac extended attributes, etc.

Rework host authentication for SFTP storage backends

Security issue in encryption key derivation?

Live backup preview (mount)

kind duplicacy and other tools benchmark report

Duplicacy, Btrfs (Streams) & Default config directory

Sharepoint support

Related tags

Cloudpods is a cloud-native open source unified multi/hybrid-cloud platform developed with Golang

Contentrouter - Protect static content via Firebase Hosting with Cloud Run and Google Cloud Storage

Lightweight Cloud Instance Contextualizer

Go language interface to Swift / Openstack Object Storage / Rackspace cloud files (golang)

The extensible SQL interface to your favorite cloud APIs.

Terraform provider for HashiCorp Cloud Platform.

The Cloud Posse Terraform Provider for various utilities (E.g. deep merging)

Cloud cost estimates for Terraform in your CLI and pull requests 💰📉

Cloud-native way to provide elastic Jupyter Notebook services on Kubernetes

Google Cloud Client Libraries for Go.

A Cloud Native Buildpack for Go

cloud-native local storage management system

Fleex allows you to create multiple VPS on cloud providers and use them to distribute your workload.

☁️🏃 Get up and running with Go on Google Cloud.

Elkeid is a Cloud-Native Host-Based Intrusion Detection solution project to provide next-generation Threat Detection and Behavior Audition with modern architecture.

Sample apps and code written for Google Cloud in the Go programming language.

Use Google Cloud KMS as an io.Reader and rand.Source.

A local emulator for Cloud Bigtable with persistance to a sqlite3 backend.

Terraform Provider for Confluent Cloud