Segfault in ocamllex-generated code using 'shortest' #7760

vicuna · 2018-04-02T17:49:43Z

Original bug ID: 7760
Reporter: @stedolan
Assigned to: @maranget
Status: resolved (set by @maranget on 2018-04-10T09:05:11Z)
Resolution: open
Priority: normal
Severity: minor
Version: 4.06.0
Fixed in version: 4.07.0+dev/beta2/rc1/rc2
Category: tools (ocaml{lex,yacc,dep,debug,...})
Monitored by: @nojb @gasche @maranget

Bug description

On my machine (amd64 Debian), the following program usually segfaults:

rule read = shortest
  | ("aa" | "bbb") (_ as x) _?   { x }
  | _ as y                       { y }

{ let _ = read (Lexing.from_string "asdf") }

when compiled and run as:

ocamllex -q -o lexer.ml lexer.mll && ocamlopt -o lexer lexer.ml && ./lexer

This example is reduced from a larger lexer. The segfault only seems to occur when using 'shortest' instead of 'parse', but I'm not sure exactly which combination of features triggers the bug. The problem is reproducible using OCaml versions back to at least 3.11.2.

The text was updated successfully, but these errors were encountered:

vicuna · 2018-04-03T08:09:56Z

Comment author: @alainfrisch

FWIW, using "ocamllex -ml" seems to work (at least, no segfault).

vicuna · 2018-04-03T09:02:34Z

Comment author: @xclerc

The latest trunk (db6891f) does not segfault on this code.

vicuna · 2018-04-03T09:09:54Z

Comment author: @nojb

db6891f segfaults for me, looks like some bad indexing on the Lexing.lex_buffer field.

vicuna · 2018-04-06T07:25:55Z

Comment author: @let-def

I started investigating this issue.

The problem triggers when one branch capture sub-values (the (_ as x) in Stephen's example), while another branch catch all cases (the _ as y; note that this is not a sub-capture, the whole lexeme is returned, this doesn't stress the same code path).

The automaton produced is correct (though not minimal :)), that's why the -ml output works. However the bytecode generated is wrong: there will be out of bounds write to the lexbuf.Lexing.lex_mem fields (it is a vector that is used to store the locations of capture groups).

If you don't capture sub-values, the lexer will use the caml_lex_engine primitive for interpretation which is correct as far as I can tell.

However, if one of the branch capture sub-values, caml_new_lex_engine is used, which can do arbitrary writes (via the run_tag C function).

Btw, this is not an initialization issue (one could think that the position vector is too short), it is because of the wrong interpretation of a tag which consumes garbage values and writes at some arbitrary offset of lex_mem.

My next step will be to instrument bytecode generation to understand what goes wrong, but I progress slowly as I found few resources on that part :).

vicuna · 2018-04-06T07:28:41Z

Comment author: @let-def

xclerc: sometimes the random write corrupts the heap, sometimes it doesn't. You will have to test in different memory conditions (and for good measures, put an assertion in run_tag to check for the bounds).

vicuna · 2018-04-06T17:59:38Z

Comment author: @xavierleroy

Maybe @maranget could look into this issue as well.

vicuna · 2018-04-09T12:25:24Z

Comment author: @maranget

I am having a look.

vicuna · 2018-04-09T14:40:50Z

Comment author: @maranget

I think I have found the bug, but I am lacking time to submit
a pull request now.

Basically, the problem originates from the table compaction function being
able to optimize away memory instructions [in compact.ml]; while the
main output function does not notice it (and thus emits a call to caml_new_lex_engine, while a call to caml_lex_engin would be approriate).

vicuna · 2018-04-10T08:45:50Z

Comment author: @maranget

There is now a pull request that corrects this PR,
see #1713

vicuna closed this as completed Apr 10, 2018

vicuna added the tools label Mar 14, 2019

vicuna assigned maranget Mar 14, 2019

vicuna added the bug label Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault in ocamllex-generated code using 'shortest' #7760

Segfault in ocamllex-generated code using 'shortest' #7760

vicuna commented Apr 2, 2018

vicuna commented Apr 3, 2018

vicuna commented Apr 3, 2018

vicuna commented Apr 3, 2018

vicuna commented Apr 6, 2018

vicuna commented Apr 6, 2018

vicuna commented Apr 6, 2018

vicuna commented Apr 9, 2018

vicuna commented Apr 9, 2018

vicuna commented Apr 10, 2018

Segfault in ocamllex-generated code using 'shortest' #7760

Segfault in ocamllex-generated code using 'shortest' #7760

Comments

vicuna commented Apr 2, 2018

Bug description

vicuna commented Apr 3, 2018

vicuna commented Apr 3, 2018

vicuna commented Apr 3, 2018

vicuna commented Apr 6, 2018

vicuna commented Apr 6, 2018

vicuna commented Apr 6, 2018

vicuna commented Apr 9, 2018

vicuna commented Apr 9, 2018

vicuna commented Apr 10, 2018