17

My final goal is refactoring code written by my coworkers. So, is there a tool which can find files differing in only few words?

(Edit: this is for a Mac, but others might like non-Mac answers too.)

James Mertz
  • 26,224
  • 41
  • 111
  • 163
tig
  • 4,624
  • 4
  • 34
  • 49
  • @harry, given [the revision history](http://superuser.com/posts/154699/revisions), I guess you posted a Windows answer, which was downvoted because only then the Mac requirement emerged? I'd rather have the Mac requirement dropped and see your answer (if it was a good non-Mac answer) as well! – Arjan Sep 18 '10 at 13:00
  • @Arjan : Done . – harrymc Sep 18 '10 at 13:48
  • For a Mac, I wondered if Spotlight could be used. I doubt it, but if you know of a way to do things in Spotlight, then the `mdfind` command might help to write some script to automate things. However, I think it will always only use meta data. Hence finding similar files might limit on file type, but not on file contents. No cigar. – Arjan Sep 18 '10 at 15:21

2 Answers2

5

Simian does this for the source code of some languages. It is best at finding blatant copy-n-paste coding. Its developments seems to have stalled, but it works good enough.

Benjamin Bannier
  • 16,044
  • 3
  • 43
  • 41
  • Did not help very much — in rails app with a lot of very similar partials it only said, that I have similar lines in development.log – tig Jun 21 '10 at 11:49
  • Did you give it the right files to analyze? You probably care about your sources, not `development.log`. For rails have a look at flay http://rubyforge.org/frs/shownotes.php?group_id=1513&release_id=38004 – Benjamin Bannier Jun 21 '10 at 13:16
  • Yes I gave it all files in rails app dir – tig Jun 22 '10 at 00:50
2

(For Windows)

The product Anti-Twin (free for private use) claims to be able to do this:

If you want Anti-Twin not only to search for full duplicates but also to similar files, you can reduce the desired minimum match from the default value of 100% to up to 60%. This function has been particularly designed for the search of almost identical files where only a tiny detail was changed. Anti-Twin uses the similarity search as soon as you enter a value below 100%. The similarity comparison takes much longer than the 100% full duplicate search!

Unfortunately, the similarity search as part of the byte-by-byte comparison only makes sense for a few file types, because a similarity can only be detected if the files are uncompressed and unencrypted. Uncompressed files are e.g. unformatted texts (.TXT) and HTML.

harrymc
  • 455,459
  • 31
  • 526
  • 924