Version française
Home     About     Download     Resources     Contact us    
Browse thread
Correct way of programming a CGI script
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: skaller <skaller@u...>
Subject: Re: [Caml-list] Re: Warning on home-made functions dealing with UTF-8.

On Tue, 2007-10-16 at 20:46 +0200, Julien Moutinho wrote:
> Here, I have reused some old code of mine to secure and extend J.Skaller's:
>   unicode_of_utf8 ~ parse_utf8
>   utf8_of_unicode ~ utf8_of_int
> May it help, and may it not be too bugged.

The UTF-8 to UCS4 string conversion probably needs an option NOT
to throw exceptions. Almost all uses are not serviced well by
throwing exceptions. 

* Exceptions are a very bad idea in the first place :)

* An alternative strategy using a replacement code may be
worth considering, possibly using an argument, and possibly
by another technique such as a wrapper function which
continues on after inserting the replacement.

I think you will find that covering all the real uses isn't
so easy .. one reason for a whole framework like Camomile.

In particular, most Standards will say things like "behaviour
is undefined if such and such" for example an invalid code.

This does NOT mean using such a code is an error, it means
the Standard leaves the behaviour open in such cases.
In particular, open to vendor extensions.

It's perfectly legitimate, for example, for an application
to encode colour and font information in the 31-21=10 remaining
bits**, but your codec will throw an exception here, instead
of just translating the codes. What the application is doing
isn't portable -- but that doesn't make it incorrect if the
application has control over the context.

In particular, the application can even define an extension
to the Standard and send such codes over file systems and
networks. Other applications may decide to support the
extensions. 

In fact this is how Standards are made: the ISO mantra is that
standards should encode existing practice, and that specifically
implies ALLOWING practice outside the existing Standards.

So my routine wasn't quite so 'home grown' .. I've actually
been a participant in ISO Standardisation processes with a
special National Body interest in I18n issues (since Australia
is highly multi-cultural and has people speaking many languages).
I18n is a real quagmire of complexity... for example there are
no known commercial text rendering routines that actually
comply with the Standard -- bidirectional rendering is
extremely difficult to do efficiently and it isn't clear that
the Standard requirements are all that useful if you happen
to be mixing English, Arabic, and Chinese in the same document.

** there is a real use of the extra bits by some Egyptologists
encoding hieroglyphics.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net