Version française
Home     About     Download     Resources     Contact us    
Browse thread
Subsequence references or substrings in OCaml
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Christian Lindig <lindig@i...>
Subject: Re: Subsequence references or substrings in OCaml
> Its simple enough to implement subsequence references as a user defined
> type in OCaml, as I've done, but I am curious about whether anyone else
> who has used similar libraries would find a built-in substring or 
> subsequence ref library useful. 

[sorry, no french version]

String processing includes scanning and splitting strings and is often
done using regular expressions.  Inspired by parser combinators I have
written a light weight library of `lexer combinators' which uses an
efficient internal representation of substrings:  a substring is a
(int * int) pair.  The first element denotes the index of the first
character and the second the length of the substring.  The interface
of the module is attached - mail me for the code.

-- Christian

-- 
Christian Lindig    Technische Universitaet Braunschweig, Germany
                    http://www.cs.tu-bs.de/softech/people/lindig
                    mailto:lindig@ips.cs.tu-bs.de

(*  ------------------------------------------------------------------ 

    $Id: lc.mli,v 1.3 1998/11/24 19:33:17 lindig Exp $
    
    This module provides lexer combinators (LCs). They can be used for
    similar purposes as regular expressions: verifying that a string
    obeys a specified syntax and extracting parts from it.  LCs are
    not as powerful as regular expression (REs) but do not require
    preprocessing like REs.
  
    The implementation aims to be reasonable efficient by using
    indices into the scanned string during the whole scan. However,
    scanning of long strings (> 1000 chars) should be avoided.       
    ------------------------------------------------------------------ *)
   
exception Error of string
(* [Error] reports the failure of a lexer with a message which is not
   very descriptive but may help finding unexpected failures *)

type region = int * int
(* While a string is scanned so-called regions (substrings) from it
   can be saved to extract them later.  A region consists of the index
   of its first letter and its length; the index of a string's first
   character is 0:

        The region (5,3) of "Objective Caml" is "tiv".
                             012345
                                  123 
                                  
   A region is always relative to a string *)

type lexer

(* A lexer scans a string to check that it matches a syntactic
   structure. It fails when the string does not match the
   expected syntactic structure *)

val succeed : lexer
(* [succeed] is a lexer that always succeeds. *) 

val fail : string -> 'a
(* [fail] always fails with a descriptve error message *)

val any : lexer
(* [any] consumes just one character when availabe and fails when the
   end of string is reached *)

val eof : lexer
(* [eof] succeeds when the end of input (i.e.  string) is reached.  It
   does not consume any character *)

val satisfy : (char -> bool) -> lexer
(* [satisfy f] consumes the next characters successfully when it
   satisfies predicate f *)

val chr : char -> lexer
(* [chr c] consumes the next character when it equals [c]; fails
   otherwise *)

val str : string -> lexer
(* [str s] succeeds when the input starts with string [s] - in this case
   it consumes as many characters as [s] is long. *)

val ( *** ) : lexer -> lexer -> lexer
(* [x *** y] is a lexer that first uses lexer [x] and then scans the
   remaining input using lexer [y]. It fails when any of [x] and [y] fail
   *)

val ( ||| ) : lexer -> lexer -> lexer
(* [x ||| y] scans the input using lexer [x]; if [x] fails it the
   input again using [y]. The lexer fails when [y] fails. *)

val many : lexer -> lexer
(* [many l] scans the input using lexer [l] as many (including zero) times as
   possible; always succeeds. Beware: [many] is greedy; i.e. [many
   any *** chr 'x'] will always fail because the `x' will be consumed
   by [many any]. *)

val some : lexer -> lexer
(* [some l] scans the input using lexer one or more times. [some] is
   greedy -- see also [many] *)

val opt : lexer -> lexer
(* [opt l] scans the input using lexer [l] when possible; when [l]
   fails the lexer does not consume input and succeeds. *)

val save : lexer -> lexer
(* [save l] scans the input using lexer l and stores the region l
   successfully scanned into the region list. It can later be used to
   extract the substring lexer [l] has sucessfully scanned. The lexer fails
   whenever [l] fails. The resulting region list is sorted by the start
   points of its regions: 

   Given a string where the following 4 regions are saved (regions can
   be nested):
        
       (1  (2   2)  (3  (4 4) 3) 1)

   In the resulting region list the order of regions will be: 
   
        1 2 3 4

   In the example above regions are marked by parentheses. Of couse
   they are still represented as (index, lengh) pairs as described
   above. 

*)

val extract : region list -> string -> string list
(* [extract l str] extracts all substrings from the region list [l] from
   string [str]. *)

val scan: string -> lexer -> (int * region list)
(* [scan str l] scans string [str] using lexer [l]. The result is a
   pair: the number of characters successfully consumed and the list of
   stored regions. [scan] raises [Error] when scanning fails. The regions
   from the region list can be passed to [extract] for extracting all
   regions as substrings from [str] *)

val scanAndExtract : string -> lexer -> string list
(* [scanAndExtract] scans using [lexer]. On successfull scan the list
   of saved substrings (see [save]) is returned. [Error] is raised when
   the string does not match *)