I have a directory with subdirectories, and a lot of duplicate files in them. If I move everything to a single rar archive, will WinRAR detect the duplicate files, or will all of them be archived and add up to the size of the rar archive?
-
5I do not think WinRar does such intelligence because not only the filenames and extensions could be same in different dir heirarchy, their contents will also be different/same. So checking each and every byte in it is costly and difficult. One test you can do is, carry out the first as said above and see size. Change 1 file name and do same and see size again. It shud be same. – Zenwalker Mar 29 '12 at 10:16
-
1As with any archive program it will archive exactly what you tell it to do. If you want to get rid of duplicates do that before you create the archive. – Ramhound Mar 29 '12 at 12:28
4 Answers
The new version of WinRAR, 5.00, has introduced the new RAR5 archive format and this feature is one of many improvements:
Save identical files as references
If this option is enabled, WinRAR analyzes the file contents before starting archiving. If several identical files larger than 64 KB are found, the first file in the set is saved as usual file and all following files are saved as references to this first file. It allows to reduce the archive size, but applies some restrictions to resulting archive. You must not delete or rename the first identical file in archive after the archive was created, because it will make extraction of following files using it as a reference impossible. If you modify the first file, following files will also have the modified contents after extracting. Extraction command must involve the first file to create following files successfully.
It is recommended to use this option only if you compress a lot of identical files, will not modify an archive later and will extract an archive entirely, without necessity to unpack or skip individual files. If all identical files are small enough to fit into compression dictionary, solid archiving can provide more flexible solution than this option.
Supported for RAR 5.0 archives only.
My quick test on a folder that contains 320,000 files (Baldur's Gate Trilogy with a lot of mods):
RAR4 compression method, compression set to "Store": 26.1 GB (28,053,815,768 bytes)
RAR5 compression method, compression set to "Store" and "Save identical files as references" turned on: 23.9 GB (25,722,664,097 bytes)
So I was able to save over 9% without using any compression at all!
- 196
- 1
- 3
-
4```You must not delete or rename the first identical file in archive after the archive was created, because it will make extraction of following files using it as a reference impossible.``` - this is bizarre, why not keep a reference counter and only delete the file once the reference counter reaches zero? that's how hardlinks works on linux filesystems.. and the storage overhead is 2 bytes per file (for an uint16 which can count up to 65535 references) - or 1 byte if you're ok with a max of 255 references – hanshenrik Jan 01 '20 at 16:53
If the files are really duplicates (or near duplicates), compression software can exploit that similarity across files to greatly increase the compression ratio. It's called Solid Compression. WinRAR and 7-Zip are 2 popular archivers that use it -- 7-Zip does by default. I'm not a RAR user so I can't tell you it's default configuration.
Common archivers on Linux/Unix/BSD systems also implicitly do solid compression by concatenating all the files together into a single file (most often via tar) before compressing that single file as a large block.
The one giant caveat to all this is that you don't really have any way of knowing exactly which files are similar, or how similar they are. It's not a good way of finding out what duplicate files you have, and extracting the archive is going to restore all that duplication. Which is, normally, exactly what one wants and expects with data compression -- to get back out exactly what was put into it.
If you want to clean up your folders, you need duplicate detection software. For normal collections, there's tons of software out there that ferrets out duplicate files. If you're dealing with media (audio, video, pictures), then you're going to want software that doesn't search for exact duplicates, but can fingerprint your files and find groups of files that are similar. That way, if you've got 2 copies of the same song with different tags or compressed slightly differently (say, a 128 Kb/s MP3 and a 256 Kb/s AAC) they can be identified. Or identifying 2 pictures of the same subject where one has been cropped or edited. Each media type often has specialized software for finding similar files, and there have been questions here before dealing with the particulars of each type. Of course, cleaning up such collections is much more difficult and time consuming because there's no fast and easy rules for deciding which file should be kept.
- 22,987
- 3
- 60
- 88
-
Thanks for the clear explanation on solid compression! I just tried 7-zip on a directory with a lot of redundancy, and got a factor 10 improvement compared to zip -- although only after significantly increasing dictionary size and word size from their default values (default parameter values yield basically no improvement). – mitchus Apr 04 '13 at 13:10
WinRAR will not do what you want. However, there are other tools that can find duplicated files inside a folder or in a partition. I have needed to do such a thing before, and I used Easy Duplicate Finder software:
Easy Duplicate Finder is a powerful tool to find and resolve duplicate photos, documents, spreadsheets, MP3's, and more! Removing duplicates will also help to speed up indexing and reduces back up size and time. Your computer isn’t fully optimized until you’ve removed all unnecessary duplicate files. Let Easy Duplicate Finder remove the duplicates!
- 4,355
- 4
- 32
- 50
- 30,192
- 65
- 150
- 222
To compress duplicate files with similar/different filenames, use these 2 options in Winrar 5
- Create Solid Archive
Solid archive is an archive packed with a special compression method, which treats several or all files within the archive as one continuous data stream. WinRAR supports solid mode only in RAR archiving format, ZIP archives are always non-solid. Solid archiving significantly increases compression when adding a large number of small, similar files.
- Save identical files as references
WinRAR analyzes the file contents before starting archiving. If several identical files are found, the first file in the set is saved as usual file and all following files are saved as references to this first file. It allows to reduce the archive size, but applies some restrictions to resulting archive. You must not delete or rename the first identical file in archive after the archive was created, because it will make extraction of following files using it as a reference impossible. If you modify the first file, following files will also have the modified contents after extracting. Extraction command must involve the first file to create following files successfully.
It is recommended to use Save identical files as references only if you compress a lot of identical files, will not modify an archive later and will extract an archive entirely, without necessity to unpack or skip individual files.
When creating a new .rar archive, the settings are in the following location:
- 569
- 1
- 3
- 14
-
Welcome to Super User! Please quote the essential parts of the answer from the reference link(s), as the answer can become invalid if the linked page(s) change. – DavidPostill Aug 27 '20 at 07:50