Re: [Camllist] Ocamlyacc vs stream parser
 Diego Olivier Fernandez Pons
[
Home
]
[ Index:
by date

by threads
]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
Date:  20030605 (14:03) 
From:  Diego Olivier Fernandez Pons <Diego.FERNANDEZ_PONS@e...> 
Subject:  Re: [Camllist] Ocamlyacc vs stream parser 
Bonjour, Michal Moskal a écrit : > Oh, so by proofbybook you're right :) But in practice LARL(1) > seems more usefull for parsing, at parsing least programming > languages. It is easier to be right when one uses uptodate books. Even if it is an excellent book, the Dragon is almost 20 years old, which in computer science is quite a lot. Anyway, that is not the main point. The question is 'are LR parsers really more powerfull/difficult to implement/ difficult to hack/difficult to understand... than LL parsers' ? And why ? There are three fundamental parts in a parser :  the control automaton  the desambiguisation technique  the nondeterminism handling system LL parsers are traditionaly compiled to pushdown automata whereas LR parsers usualy use LR(0) automata. In the first case the stack is explicit and transitions have the form (q, E, t, E', q') which means q > q' is possible if (E, t) is the current stack * string, and if the transition is taken, (E, t) will be removed and (E', epsilon) pushed. In the second case the stack is implicit and contains the list of the previously visited nodes. When a transition (q, t, q') is taken, q is pushed in the stack. When a reduction takes place A > abc, 3 symbols of the stack are popped. This may seem a big difference, but both stackautomata are equivalent. You can transform one into the other and then write a LL parser with a LR(0) control automata, and conversely a pda LR parser. In fact, the Dragon book (chapter 4.4 of the french edition at least) shows a LL parser controled by a LR(0) automaton. Wim Pijls wrote in 1993 a paper titled 'Unifying LL and LR parsing' in which he shows that 'traversing the LR(0) automaton in a special way is equivalent to LL(1) parsing'. (More recent papers 'LR, LC and LL parsing, some new points of view'. All papers available on citeseer) Then, LR and LL have the same (theoretical) power, which is the ability of parsing algebraic (ie contextfree) grammars. Of course... but conflicts ? When you clearly separate the control automaton and the desambiguisation mechanism, you notice that all usual algorithms (LL(k), LR(k), LALR(k), SLR(k)) use the same principle : approximating algebraic languages by rational ones. Let's take an LL conflict example : sentence to parse 'abaa' S > aX  aY  cZ X > b... Y > c... Z > d... Which one to chose, X or Y ? For an LR algorithm, a shift/reduce conflict X > a Y > aa a reduce/reduce conflict X > a Y > a Every nonterminal X in a grammar generates a contextfree language L(X) which can be approximated (upper bound) by a rational language R(X). The idea is that checking if a sentence can be derivated from a rational language is linear while it is cubic for the algebraic case. (LL example) L(X) is included in aE* L(Y) is included in bE* When your entery is ab... if it can be derived by the grammar, the only way is to be derived by X (S > aX > ab...) Both LL(k) and LR(k) use a kprefix rational approximation technique. That is why their tables occupate so much space (the cartesian product of the control automaton and a ktrie). LALR and SLR follow the same idea but merge some states according to some given rules. Then, there is no reason for LR, LALR and SLR to be more conceptually difficult than LL. Notice that TAG (Tree Adjoint Grammars) and HPSG (Head driven Phrase Structure Grammar) try to compute more discriminating approximations using the properties of some particular terminals (named 'foot' in the first, 'head' in the latter). The difference of power can be explained by a few simple arguments :  LL(1) should be compared to LR(0), not to LR(1)  topdown algorithms elagate, bottomup algorithms memorize. But the approximation is essentially an elagation technique. Then, LR take more advantage of this orthogonality. But the same power could be achieved for LL parsers if they were added memorization (LL + CPS, ...). Finally, the nondeterminism handling question... There are some conflicts you just cannot solve (otherwhise you would have proven rational = algebraic). There are two techniques : breadth first search (GLR) and depthfirst search (backtracking). Both require memorization not to compute several times the same thing. Of course, Masaru Tomita named in 1986 his parsing technique (using ideas of Bernard Lang in 1974, theirselves taked from Jan Earley 1970) General LR. But it could have been General LL too, since graph traversals are generic algorithms. Brian Rogoff a écrit : > I happen to think that recursive descent is the best way to write > parsers, but note that recursive descent parsers are capable of > parsing nonLL(1) grammars, even without the fairly obvious hacks. Michal Moskal a écrit : > But keyword parser build with camlp4 libraries can be modified at > runtime The ease of implementation is another classical discussion. The parsing algorithm (ascending or descending) is orthogonal to its recursive functions/table implementation or the static/dynamic data structures used. Most of the people think 'huge tables' as soon as they hear LR(1). But this is only the case if you make the two following design choices  collapsing the control automaton and the desambiguisation automata  representing graphs by tables You can of course implement a LR(0) automaton with recursive functions (recursive ascent), using function calls and exceptions (or any equivalent technique...). LR(0) has in general a quite moderated number of states. You can represent all your graphs (stack or desambiguisation automata) with purely functional data structures, leading to a parser which can be modified at runtime. More over, separating correctly the three parts of your parsers allow you to use different representation/optimisation techniques for every element : nondeterministic bit atomata for desambiguisation automata of less than 31 states, matrix representation for dense graphs, ... instead of one monolithic 'table compression' technique. And all this algorithms can be hidden behind a library, not to be seen by the lowlevel students, casual users, common programmers... Last point, as said by Luc Maranget, exhaustiveness can be computed on pda or lr(0) automata. Not having anything to do with conflict resolution, there is no need to work on the desambiguisation automata. Diego Olivier  To unsubscribe, mail camllistrequest@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/camlbugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners