forked from markfasheh/duperemove
-
Notifications
You must be signed in to change notification settings - Fork 0
Tools for deduping file systems
License
GPL-2.0, Unknown licenses found
Licenses found
GPL-2.0
LICENSE
Unknown
LICENSE.xxhash
rx80/duperemove
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Duperemove Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing extents that match each other. When given the -d option, duperemove will submit those extents for deduplication using the btrfs-extent-same ioctl. Duperemove has two major modes of operation one of which is a subset of the other. Readonly / Non-deduplicating Mode When run without -d (the default) duperemove will print out one or more tables of matching extents it has determined would be ideal candidates for deduplication. As a result, readonly mode is useful for seeing what duperemove might do when run with '-d'. The output could also be used by some other software to submit the extents for deduplication at a later time. It is important to note that this mode will not print out *all* instances of matching extents, just those it would consider for deduplication. Generally, duperemove does not concern itself with the underlying representation of the extents it processes. Some of them could be compressed, undergoing I/O, or even have already been deduplicated. In dedupe mode, the kernel handles those details and therefore we try not to replicate that work. Deduping Mode This functions similarly to readonly mode with the exception that the duplicated extents found in our "read, hash, and compare" step will actually be submitted for deduplication. An estimate of the total data deduplicated will be printed after the operation is complete. This estimate is calculated by comparing the total amount of shared bytes in each file before and after the dedupe. See the duperemove man page for further details about running duperemove. REQUIREMENTS Kernel: Duperemove needs a kernel version equal to or greater than 2.6.33. Libraries: Duperemove uses libgcrypt for hashing. FAQ * Is there an upper limit to the amount of data duperemove can process? Right now duperemove has been tested on small numbers of VMS or iso files (5-10). I don't believe there should be a major problem scaling that up to 50 or so. * Why does it not print out all duplicate extents? Internally duperemove is classifying extents based on various criteria like length, number of identical extents, etc. The printout we give is based on the results of that classification. * How can I find out my space savings after a dedupe? Duperemove will print out an estimate of the saved space after a dedupe operation for you. You can also do a df before the dedupe operation, then a df about 60 seconds after the operation. It is common for btrfs space reporting to be 'behind' while delayed updates get processed, so an immediate df after deduping might not show any savings. * Why is the total deduped data report an estimate? At the moment duperemove can detect that some underlying extents are shared with other files, but it can not resolve which files those extents are shared with. Imagine duperemove is examing a series of files and it notes a shared data region in one of them. That data could be shared with a file outside of the series. Since duperemove can't resolve that information it will account the shared data against our dedupe operation while in reality, the kernel might deduplicate it further for us. USAGE EXAMPLES TODO
About
Tools for deduping file systems
Resources
License
GPL-2.0, Unknown licenses found
Licenses found
GPL-2.0
LICENSE
Unknown
LICENSE.xxhash
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- C 94.2%
- Roff 4.7%
- Makefile 1.1%