English version
Accueil     À propos     Téléchargement     Ressources     Contactez-nous    

Ce site est rarement mis à jour. Pour les informations les plus récentes, rendez-vous sur le nouveau site OCaml à l'adresse ocaml.org.

Browse thread
Ocaml clone detector
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2009-09-03 (15:06)
From: Nicolas barnier <barnier@r...>
Subject: Re: [Caml-list] Ocaml clone detector
An amazing and simple technology to detect plagiarism is
compression-based similarity distance. It is a side-effect
of state-of-the-art compression algorithms that can be used
to compute a distance for many kind of documents (it seems
to work at least for program sources, books, music, DNA etc):
take any two files A and B, compress A, compress B, and compress
the concatenation of A and B, i.e. AB; take the size of these
compressed files c(A), c(B) and c(AB); the similarity distance
is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)).
Indeed, if documents A and B share information, the compression
of AB will be much shorter than c(A) + c(B).

A good article (in French unfortunately) can be found at:


where a link points to "Baldr", a free Java application written
by a French CS professor to compute this distance pairwise for a
set of source codes and sort the result. What appears is that the
distance between two documents will be much smaller in case of
plagiarism than for any other two (even if good students will tend
to produce close source codes for the same exam).

I wrote a small Ocaml program (baldml) to perform the same task
(but without GUI):

./baldml.opt -algo bz2 -regexp ".+ml$" -n 3 dir

where you can choose the compression algorithm among bz2 or gzip,
specify a Str-style regexp to filter the files (Ocaml files in the
example, but I use it as well for C exam) and the number of sorted
lines you want in the output among the n(n-1)/2 unique pairs of the
n matching files recursively found in the directory "dir".

I can provide the 100 lines of code it if anyone is interested.

Hope this helps,

-- Nicolas