Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] [ANN] The Missing Library
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Yamagata Yoriyuki <yoriyuki@m...>
Subject: Re: [Caml-list] Re: Common IO structure
From: John Goerzen <jgoerzen@complete.org>
Subject: Re: [Caml-list] Re: Common IO structure
Date: Thu, 29 Apr 2004 09:02:40 -0500

> On Thu, Apr 29, 2004 at 10:40:36PM +0900, Yamagata Yoriyuki wrote:
> > > > > > OK, but then you can leave out readline(), readlines() and xreadlines(), 
> > > > > > because they don't make any sense unless you've already dealt with 
> > > > > > character encodings.
> > > > > 
> > > > > No, they can simply be implemented in terms of read().
> > > > 
> > > > It will break when UTF-16/UTF-32 are used.  The line separator should
> > > > be handled after code conversion.  At least that is the idea of
> > > > Unicode standard.  (But Since Unicode standard is challenged by
> > > > reality in every aspect, maybe nobody cares.)
> > > 
> > > You are missing the point.  read() could handle the code conversion.
> > 
> > No, what I wanted to say is that the line separator should be handled
> > in the Unicode level, not the byte-character level.  Your design
> > assumes read() always returns new line characters as in ASCII.  This
> > would not hold when read() returns UTF-16/UTF-32.
> 
> I don't see why that is the case.  If read() returns UTF-16 data,
> readlines() works with it, and would of course be scanning it for a
> UTF-16 EOL character or string.  I don't see where that's the problem.

Encoding could be stateful, so there would be no single representation
of EOL. (*)  Ok, this is very unlikely case currently, but I think there
is an interesting encoding for Unicode which is fully stateful.  So,
readlines() needs to fully aware of the encoding.

My proposal is mainly for sharing common channel types among
libraries, so that a user can pass a channel from a libraries to
anonther withoug writing a glue code.  Since parsing endline, or
loading the whole file into the string mainly occurs in the endpoint
of IO, I do not think standardizing them are necessary for this
purpose.

I do not think standardizing the endpoint API is important, because I
think that in the end, we will use only one library as the endpoint of
IO.

(*) IIRC, RFC defines the endianness of UTF-16 is swapped in the
middle of the stream, when "BOM" 0xfffe appears.

--
Yamagata Yoriyuki

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners