Re: Subsequence references or substrings in OCaml

From: Christian Lindig (lindig@ips.cs.tu-bs.de)
Date: Mon Dec 21 1998 - 11:39:17 MET


Date: Mon, 21 Dec 1998 11:39:17 +0100
From: Christian Lindig <lindig@ips.cs.tu-bs.de>
To: Brian Rogoff <bpr@best.com>
Subject: Re: Subsequence references or substrings in OCaml
In-Reply-To: <Pine.BSF.4.05.9812181044130.27244-100000@shell5.ba.best.com>; from Brian Rogoff on Fri, Dec 18, 1998 at 11:27:57AM -0800

> Its simple enough to implement subsequence references as a user defined
> type in OCaml, as I've done, but I am curious about whether anyone else
> who has used similar libraries would find a built-in substring or
> subsequence ref library useful.

[sorry, no french version]

String processing includes scanning and splitting strings and is often
done using regular expressions. Inspired by parser combinators I have
written a light weight library of `lexer combinators' which uses an
efficient internal representation of substrings: a substring is a
(int * int) pair. The first element denotes the index of the first
character and the second the length of the substring. The interface
of the module is attached - mail me for the code.

-- Christian

-- 
Christian Lindig    Technische Universitaet Braunschweig, Germany
                    http://www.cs.tu-bs.de/softech/people/lindig
                    mailto:lindig@ips.cs.tu-bs.de

(* ------------------------------------------------------------------

$Id: lc.mli,v 1.3 1998/11/24 19:33:17 lindig Exp $ This module provides lexer combinators (LCs). They can be used for similar purposes as regular expressions: verifying that a string obeys a specified syntax and extracting parts from it. LCs are not as powerful as regular expression (REs) but do not require preprocessing like REs. The implementation aims to be reasonable efficient by using indices into the scanned string during the whole scan. However, scanning of long strings (> 1000 chars) should be avoided. ------------------------------------------------------------------ *) exception Error of string (* [Error] reports the failure of a lexer with a message which is not very descriptive but may help finding unexpected failures *)

type region = int * int (* While a string is scanned so-called regions (substrings) from it can be saved to extract them later. A region consists of the index of its first letter and its length; the index of a string's first character is 0:

The region (5,3) of "Objective Caml" is "tiv". 012345 123 A region is always relative to a string *)

type lexer

(* A lexer scans a string to check that it matches a syntactic structure. It fails when the string does not match the expected syntactic structure *)

val succeed : lexer (* [succeed] is a lexer that always succeeds. *)

val fail : string -> 'a (* [fail] always fails with a descriptve error message *)

val any : lexer (* [any] consumes just one character when availabe and fails when the end of string is reached *)

val eof : lexer (* [eof] succeeds when the end of input (i.e. string) is reached. It does not consume any character *)

val satisfy : (char -> bool) -> lexer (* [satisfy f] consumes the next characters successfully when it satisfies predicate f *)

val chr : char -> lexer (* [chr c] consumes the next character when it equals [c]; fails otherwise *)

val str : string -> lexer (* [str s] succeeds when the input starts with string [s] - in this case it consumes as many characters as [s] is long. *)

val ( *** ) : lexer -> lexer -> lexer (* [x *** y] is a lexer that first uses lexer [x] and then scans the remaining input using lexer [y]. It fails when any of [x] and [y] fail *)

val ( ||| ) : lexer -> lexer -> lexer (* [x ||| y] scans the input using lexer [x]; if [x] fails it the input again using [y]. The lexer fails when [y] fails. *)

val many : lexer -> lexer (* [many l] scans the input using lexer [l] as many (including zero) times as possible; always succeeds. Beware: [many] is greedy; i.e. [many any *** chr 'x'] will always fail because the `x' will be consumed by [many any]. *)

val some : lexer -> lexer (* [some l] scans the input using lexer one or more times. [some] is greedy -- see also [many] *)

val opt : lexer -> lexer (* [opt l] scans the input using lexer [l] when possible; when [l] fails the lexer does not consume input and succeeds. *)

val save : lexer -> lexer (* [save l] scans the input using lexer l and stores the region l successfully scanned into the region list. It can later be used to extract the substring lexer [l] has sucessfully scanned. The lexer fails whenever [l] fails. The resulting region list is sorted by the start points of its regions:

Given a string where the following 4 regions are saved (regions can be nested): (1 (2 2) (3 (4 4) 3) 1)

In the resulting region list the order of regions will be: 1 2 3 4

In the example above regions are marked by parentheses. Of couse they are still represented as (index, lengh) pairs as described above.

*)

val extract : region list -> string -> string list (* [extract l str] extracts all substrings from the region list [l] from string [str]. *)

val scan: string -> lexer -> (int * region list) (* [scan str l] scans string [str] using lexer [l]. The result is a pair: the number of characters successfully consumed and the list of stored regions. [scan] raises [Error] when scanning fails. The regions from the region list can be passed to [extract] for extracting all regions as substrings from [str] *)

val scanAndExtract : string -> lexer -> string list (* [scanAndExtract] scans using [lexer]. On successfull scan the list of saved substrings (see [save]) is returned. [Error] is raised when the string does not match *)



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:17 MET