Version française
Home     About     Download     Resources     Contact us    
Browse thread
Correct way of programming a CGI script
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Julien Moutinho <julien.moutinho@g...>
Subject: Warning on home-made functions dealing with UTF-8.
On Fri, Oct 12, 2007 at 12:48:16AM +1000, skaller wrote:
> On Thu, 2007-10-11 at 16:21 +0200, Vincent Hanquez wrote:
> > On Thu, Oct 11, 2007 at 11:54:24PM +1000, skaller wrote:
> > > You can't: Camomile is massive for a reason.. the problem it
> > > aims to solve is complex and hard to do efficiently without
> > > a large set of specialised functions.
> > 
> > You are assuming that i want efficiency where i want to print few
> > unicode string in an ui here and there. I *DON'T* want to be exposed to
> > full unicode, i need something like 1/100 of camomile library.
> 
> In that case, you can use an int Array.t for Unicode provided 
> it is only 31 bit OR you have a 64 bit machine. These routines 
> should help converting to and from UTF-8:
> [...]

Just in case someone would want to use this parse_utf8,
be aware that depending on the trust you have in your input,
it may be sorely discouraged to do so.
Indeed, this code does not check comprehensively for invalid characters.

eg. for characters with an overlong form [1]:

# let mk = List.fold_left
    (fun acc c -> acc ^ String.make 1 (Char.chr c)) "";;
val mk : int list -> string = <fun>
# let p l = parse_utf8 (mk l) 0;;
val p : int list -> int * int = <fun>

(* unicode 0 coded into an overlong utf-8 form *)
# p [0b11_000000; 0b10_000000];;
- : int * int = (0, 2)

Nor does it checks for invalid trailing bytes :

(* unicode 64 (@) with and invalid trailing byte,
 * which happens to be a zero *)
# p [0b11_000001; 0b00_00000];;
- : int * int = (64, 2)

Besides "now" an unicode value needs only 21 bits
and "therefore" an utf-8 char holds into at most 4 bytes,
not 6 as the code handles.

[1] http://en.wikipedia.org/wiki/UTF-8#Overlong_forms.2C_invalid_input.2C_and_security_considerations