parsing - How do you write a lexer parser where identifiers may begin with keywords? -


suppose have language identifiers might begin keywords. example, suppose "case" keyword, "caser" valid identifier. suppose lexer rules can handle regular expressions. seems can't place keyword rules ahead of identifier rule in lexer, because parse "caser" "case" followed "r". can't place keyword lexing rules after identifier rule, since identifier rule match keywords, , keyword rules never trigger.

so, instead, make keyword_or_identifier rule in lexer, , have parser determine if keyword_or_identifier keyword or identifier. done?

i realize "use different lexer has lookahead" answer (kind of), i'm interested in how done in traditional dfa-based lexer, since current lexer seems work way.

most lexers, starting original lex, match alternatives follows:

  1. use longest match.

  2. if there 2 or more alternatives tie longest match, use first 1 in lexer definition.

this allows following style:

"case"                  { return case; } [[:alpha:]][[:alnum:]]* { return id; } 

if input pattern caser, second alternative used because it's longest match. if input pattern case r, first alternative used because both of them match case, , rule (2) above, first 1 wins.

although may seem bit arbitrary, it's consistent dfa approach, mostly. first of all, dfa doesn't stop first time reaches accepting state. if did, patterns [[:alpha:]][[:alnum:]]* useless, because enter accepting state on first character (assuming alphabetic). instead, dfa-based lexers continue until there no possible transitions current state, , until last accepting state. (see below.)

a given dfa state may accepting because of 2 different rules, that's not problem either; first accepting rule recorded.

to fair, different mathematical model of dfa, has transition every symbol in every state (although many of them may transitions "sink" state), , matches complete input depending on whether or not automaton in accepting state when last symbol of input read. lexer model different, can formalized well.

the difficulty in theoretical model "back last accepting state". in practice, done remembering state , input position every time accepting state reached. mean may necessary rewind input stream, possibly arbitrary amount.

most languages not require backup often, , few require indefinite backup. lexer generators can generate faster code if there no backup states. (flex if use -cf or -cf.)

one common case leads indefinite backup failing provide appropriate error return string literals:

["][^"\n]*["]    { return string; }  /* ... */ .                { return invalid; } 

here, first pattern match string literal starting " if there matching " on same line. (i omitted \-escapes simplicity.) if string literal unterminated, last pattern match, input need rewound ". in cases, it's pointless trying continue lexical analysis ignoring unmatched "; make more sense ignore entire remainder of line. not backing inefficient, lead explosion of false error messages. better solution might be:

["][^"\n]*["]    { return string; }  ["][^"\n]*       { return invalid_string; } 

here, second alternative can succeed if string unterminated, because if string terminated, first alternative match 1 more character. consequently, doesn't matter in order these alternatives appear, although know put them in same order did.


Comments

Popular posts from this blog

php - Why I am getting the Error "Commands out of sync; you can't run this command now" -

linux - Does gcc have any options to add version info in ELF binary file? -

java - Are there any classes that implement javax.persistence.Parameter<T>? -