Browse thread
Ocaml clone detector
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2009-09-03 (23:13) |
From: | John Clements <aoeucaml@b...> |
Subject: | Re: [Caml-list] Ocaml clone detector |
On Sep 3, 2009, at 8:06 AM, Nicolas barnier wrote: > An amazing and simple technology to detect plagiarism is > compression-based similarity distance. It is a side-effect > of state-of-the-art compression algorithms that can be used > to compute a distance for many kind of documents (it seems > to work at least for program sources, books, music, DNA etc): > take any two files A and B, compress A, compress B, and compress > the concatenation of A and B, i.e. AB; take the size of these > compressed files c(A), c(B) and c(AB); the similarity distance > is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)). > Indeed, if documents A and B share information, the compression > of AB will be much shorter than c(A) + c(B). Also see Alex Aiken's "MOSS" (measure of software similarity). It's online, language-specific, works for a variety of languages. Don't know how its algorithm compares to the one here. I suspect it's different insofar the one you describe is language-independent. John Clements