Haskell Alex - regex matches wrong string? -
i'm trying write lexer indentation-based grammar , i'm having trouble matching indentation.
here's code:
{ module lexer ( main ) import system.io.unsafe } %wrapper "monaduserstate" $whitespace = [\ \t\b] $digit = 0-9 -- digits $alpha = [a-za-z] $letter = [a-za-z] -- alphabetic characters $ident = [$letter $digit _] -- identifier character $indent = [\ \t] @number = [$digit]+ @identifier = $alpha($alpha|_|$digit)* error:- @identifier { mkl lvarid } \n $whitespace* \n { skip } \n $whitespace* { setindent } $whitespace+ { skip } { data lexeme = lexeme alexposn lexemeclass (maybe string) instance show lexeme show (lexeme _ leof _) = " lexeme eof" show (lexeme p cl mbs) = " lexeme class=" ++ show cl ++ showap p ++ showst mbs showap pp = " posn=" ++ showposn pp showst nothing = "" showst (just s) = " string=" ++ show s instance eq lexeme (lexeme _ cls1 _) == (lexeme _ cls2 _) = cls1 == cls2 showposn :: alexposn -> string showposn (alexpn _ line col) = show line ++ ':': show col tokposn :: lexeme -> alexposn tokposn (lexeme p _ _) = p data lexemeclass = lvarid | ltindent int | ltdedent int | lindent | ldedent | leof deriving (show, eq) mkl :: lexemeclass -> alexinput -> int -> alex lexeme mkl c (p, _, _, str) len = return (lexeme p c (just (take len str))) data alexuserstate = alexuserstate { indent :: int } alexinituserstate :: alexuserstate alexinituserstate = alexuserstate 0 type action = alexinput -> int -> alex lexeme getlexerindentlevel :: alex int getlexerindentlevel = alex $ \s@alexstate{alex_ust=ust} -> right (s, indent ust) setlexerindentlevel :: int -> alex () setlexerindentlevel = alex $ \s@alexstate{alex_ust=ust} -> right (s{alex_ust=(alexuserstate i)}, ()) setindent :: action setindent input@(p, _, _, str) = --let !x = unsafeperformio $ putstrln $ "|matched string: " ++ str ++ "|" lastindent <- getlexerindentlevel currindent <- countindent (drop 1 str) 0 -- first char \n if (lastindent < currindent) setlexerindentlevel currindent mkl (ltindent (currindent - lastindent)) input else if (lastindent > currindent) setlexerindentlevel currindent mkl (ltdedent (lastindent - currindent)) input else alexmonadscan countindent str total | take 1 str == "\t" = skip input 1 countindent (drop 1 str) (total+1) | take 4 str == " " = skip input 4 countindent (drop 4 str) (total+1) | otherwise = return total alexeof :: alex lexeme alexeof = return (lexeme undefined leof nothing) scanner :: string -> either string [lexeme] scanner str = let loop = tok@(lexeme _ cl _) <- alexmonadscan if (cl == leof) return [tok] else toks <- loop return (tok:toks) in runalex str loop addindentations :: [lexeme] -> [lexeme] addindentations (lex@(lexeme pos (ltindent c) _):ls) = concat [iter lex c, addindentations ls] iter lex c = if c == 0 [] else (lexeme pos lindent nothing):(iter lex (c-1)) addindentations (lex@(lexeme pos (ltdedent c) _):ls) = concat [iter lex c, addindentations ls] iter lex c = if c == 0 [] else (lexeme pos ldedent nothing):(iter lex (c-1)) addindentations (l:ls) = l:(addindentations ls) addindentations [] = [] main = s <- getcontents return () print $ fmap addindentations (scanner s) } problem in line \n $whitespace* { setindent }, regex matches wrong string , calls setindent wrong string. debugging purposes, added unsafeperformio in setindent function, here's example run of program:
begin first indent |matched string: first indent second indent second indent dedent dedent | |matched string: second indent dedent | |matched string: dedent | |matched string: | right [ lexeme class=lvarid posn=1:1 string="begin", lexeme class=lindent posn=1:6, lexeme class=lvarid posn=2:15 string="indent", lexeme class=lindent posn=2:21, lexeme class=ldedent posn=3:30, lexeme class=ldedent posn=3:30, lexeme class=lvarid posn=4:1 string="dedent", lexeme eof] so setindent called more whitespaces. , after returns lexeme indentation, other part of string omitted.
is bug in alex? or doing wrong?
so haven't analysed code in detail, did notice this:
setindent :: action setindent input@(p, _, _, str) = --let !x = unsafeperformio $ putstrln $ "|matched string: " ++ str ++ "|" note str rest of input, not current token. current token, want take str. perhaps giving impression token matching more of input is.
we handle indentation in ghc's own lexer of course, might want look there ideas (although might expect it's rather large , complicated).
Comments
Post a Comment