English version
Accueil     À propos     Téléchargement     Ressources     Contactez-nous    

Ce site est rarement mis à jour. Pour les informations les plus récentes, rendez-vous sur le nouveau site OCaml à l'adresse ocaml.org.

Browse thread
[Caml-list] Bug with really_input under cygwin
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2004-03-10 (03:03)
From: skaller <skaller@u...>
Subject: Re: [Caml-list] Bug with really_input under cygwin
On Wed, 2004-03-10 at 09:30, Eric Dahlman wrote:
> Howdy all,
> I have some code which is reads in a whole file in and returns it as a 
> string.  

The only correct way to do this is to read a block at a time
until you get a partial block.

This is so EVEN in 'binary' mode, which is just another
ill conceived Unix hack :-)

Generally speaking, every output method should specify
a retrieval method or two, and you will only get well
defined results if you use the specified retrieval method.

It is unfortunate that C and Unix do not provide a coherent
abstraction in this area. Even binary I/O is ill-conceived:
who says the bytes get written in order and read in the
same order? What if one channel is opened in 16 bit word
mode, and the other 8 bit mode?

C has been plagued by extremely ill considered functions.
Even the basic IO operation is not correctly defined.
In particular the function putc(int) is an invalid specification.
What happens if int = char and you have 1's complement encoding?

The bottom line is: if you wrote the file yourself,
there should be no problem. Just use BASIC I/O operations.
Functions like 'in_channel_length' are not properly defined
in the Ocaml manual and therefore should not be used.

There is no such thing as 'the number of characters
in a file'. Perhaps there is a number of bytes in a file.
Perhaps, using some decoding technique there is a well
defined number of Unicode/ISO-10646 code points.

In MS-DOS, files *always* consist of a number of 256
byte blocks. It is impossible to have a file with
a non-256 byte multiple size. Of course, text files
uses an encoding with a Ctrl-Z at the end. So the length
of the file 'in bytes' is not the same as the length
of the file 'in Latin-1'. The number of lines in the
file isn't well defined: CR/LF marks end of line,
but what happens if the CR and LF are scattered randomly?

Under Linux, the Standard for text encoding is UTF-8.
So 'characters' <> bytes unless the text is in the ASCII
subset. Even that is not clear, since if you get a 
code point 0 (NUL) some C functions will return
a false result, for example fgets().

I personally believe the easiest way to work around this
quagmire of malspecification is to 

(a) ONLY use 8 bit binary I/O
(b) ALWAYS read and write bytes

even if you're processing text. Never depend on the
language or OS conversion functions, its very unlikely
they'll be right. Do all the conversions needed yourself.
At least when you find a problem you're not handling
correctly you can fix it.

John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners