Haskell Alex - regex matches wrong string? -
i'm trying write lexer indentation-based grammar , i'm having trouble matching indentation.
here's code:
{ module lexer ( main ) import system.io.unsafe } %wrapper "monaduserstate" $whitespace = [\ \t\b] $digit = 0-9 -- digits $alpha = [a-za-z] $letter = [a-za-z] -- alphabetic characters $ident = [$letter $digit _] -- identifier character $indent = [\ \t] @number = [$digit]+ @identifier = $alpha($alpha|_|$digit)* error:- @identifier { mkl lvarid } \n $whitespace* \n { skip } \n $whitespace* { setindent } $whitespace+ { skip } { data lexeme = lexeme alexposn lexemeclass (maybe string) instance show lexeme show (lexeme _ leof _) = " lexeme eof" show (lexeme p cl mbs) = " lexeme class=" ++ show cl ++ showap p ++ showst mbs showap pp = " posn=" ++ showposn pp showst nothing = "" showst (just s) = " string=" ++ show s instance eq lexeme (lexeme _ cls1 _) == (lexeme _ cls2 _) = cls1 == cls2 showposn :: alexposn -> string showposn (alexpn _ line col) = show line ++ ':': show col tokposn :: lexeme -> alexposn tokposn (lexeme p _ _) = p data lexemeclass = lvarid | ltindent int | ltdedent int | lindent | ldedent | leof deriving (show, eq) mkl :: lexemeclass -> alexinput -> int -> alex lexeme mkl c (p, _, _, str) len = return (lexeme p c (just (take len str))) data alexuserstate = alexuserstate { indent :: int } alexinituserstate :: alexuserstate alexinituserstate = alexuserstate 0 type action = alexinput -> int -> alex lexeme getlexerindentlevel :: alex int getlexerindentlevel = alex $ \s@alexstate{alex_ust=ust} -> right (s, indent ust) setlexerindentlevel :: int -> alex () setlexerindentlevel = alex $ \s@alexstate{alex_ust=ust} -> right (s{alex_ust=(alexuserstate i)}, ()) setindent :: action setindent input@(p, _, _, str) = --let !x = unsafeperformio $ putstrln $ "|matched string: " ++ str ++ "|" lastindent <- getlexerindentlevel currindent <- countindent (drop 1 str) 0 -- first char \n if (lastindent < currindent) setlexerindentlevel currindent mkl (ltindent (currindent - lastindent)) input else if (lastindent > currindent) setlexerindentlevel currindent mkl (ltdedent (lastindent - currindent)) input else alexmonadscan countindent str total | take 1 str == "\t" = skip input 1 countindent (drop 1 str) (total+1) | take 4 str == " " = skip input 4 countindent (drop 4 str) (total+1) | otherwise = return total alexeof :: alex lexeme alexeof = return (lexeme undefined leof nothing) scanner :: string -> either string [lexeme] scanner str = let loop = tok@(lexeme _ cl _) <- alexmonadscan if (cl == leof) return [tok] else toks <- loop return (tok:toks) in runalex str loop addindentations :: [lexeme] -> [lexeme] addindentations (lex@(lexeme pos (ltindent c) _):ls) = concat [iter lex c, addindentations ls] iter lex c = if c == 0 [] else (lexeme pos lindent nothing):(iter lex (c-1)) addindentations (lex@(lexeme pos (ltdedent c) _):ls) = concat [iter lex c, addindentations ls] iter lex c = if c == 0 [] else (lexeme pos ldedent nothing):(iter lex (c-1)) addindentations (l:ls) = l:(addindentations ls) addindentations [] = [] main = s <- getcontents return () print $ fmap addindentations (scanner s) }
problem in line \n $whitespace* { setindent }
, regex matches wrong string , calls setindent
wrong string. debugging purposes, added unsafeperformio
in setindent
function, here's example run of program:
begin first indent |matched string: first indent second indent second indent dedent dedent | |matched string: second indent dedent | |matched string: dedent | |matched string: | right [ lexeme class=lvarid posn=1:1 string="begin", lexeme class=lindent posn=1:6, lexeme class=lvarid posn=2:15 string="indent", lexeme class=lindent posn=2:21, lexeme class=ldedent posn=3:30, lexeme class=ldedent posn=3:30, lexeme class=lvarid posn=4:1 string="dedent", lexeme eof]
so setindent
called more whitespaces. , after returns lexeme indentation, other part of string omitted.
is bug in alex? or doing wrong?
so haven't analysed code in detail, did notice this:
setindent :: action setindent input@(p, _, _, str) = --let !x = unsafeperformio $ putstrln $ "|matched string: " ++ str ++ "|"
note str
rest of input, not current token. current token, want take str
. perhaps giving impression token matching more of input is.
we handle indentation in ghc's own lexer of course, might want look there ideas (although might expect it's rather large , complicated).
Comments
Post a Comment