English version
Accueil     À propos     Téléchargement     Ressources     Contactez-nous    

Ce site est rarement mis à jour. Pour les informations les plus récentes, rendez-vous sur le nouveau site OCaml à l'adresse ocaml.org.

Browse thread
Ocaml clone detector
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2009-09-03 (23:13)
From: John Clements <aoeucaml@b...>
Subject: Re: [Caml-list] Ocaml clone detector

On Sep 3, 2009, at 8:06 AM, Nicolas barnier wrote:

> An amazing and simple technology to detect plagiarism is
> compression-based similarity distance. It is a side-effect
> of state-of-the-art compression algorithms that can be used
> to compute a distance for many kind of documents (it seems
> to work at least for program sources, books, music, DNA etc):
> take any two files A and B, compress A, compress B, and compress
> the concatenation of A and B, i.e. AB; take the size of these
> compressed files c(A), c(B) and c(AB); the similarity distance
> is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)).
> Indeed, if documents A and B share information, the compression
> of AB will be much shorter than c(A) + c(B).

Also see Alex Aiken's "MOSS" (measure of software similarity).  It's  
online, language-specific, works for a variety of languages.  Don't  
know how its algorithm compares to the one here. I suspect it's  
different insofar the one you describe is language-independent.

John Clements