Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
Ocaml clone detector
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2009-09-03 (23:13)
From: John Clements <aoeucaml@b...>
Subject: Re: [Caml-list] Ocaml clone detector

On Sep 3, 2009, at 8:06 AM, Nicolas barnier wrote:

> An amazing and simple technology to detect plagiarism is
> compression-based similarity distance. It is a side-effect
> of state-of-the-art compression algorithms that can be used
> to compute a distance for many kind of documents (it seems
> to work at least for program sources, books, music, DNA etc):
> take any two files A and B, compress A, compress B, and compress
> the concatenation of A and B, i.e. AB; take the size of these
> compressed files c(A), c(B) and c(AB); the similarity distance
> is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)).
> Indeed, if documents A and B share information, the compression
> of AB will be much shorter than c(A) + c(B).

Also see Alex Aiken's "MOSS" (measure of software similarity).  It's  
online, language-specific, works for a variety of languages.  Don't  
know how its algorithm compares to the one here. I suspect it's  
different insofar the one you describe is language-independent.

John Clements