Safe Haskell	Safe
Language	Haskell2010

Control.Monad.Tokenizer.Char8.Strict

Contents

Description

A monad for writing pure tokenizers in an imperative-looking way.

Main idea: You walk through the input string like a turtle, and everytime you find a token boundary, you call emit. If some specific kinds of tokens should be suppressed, you can discard them instead (or filter afterwards).

This module supports is specialized for strict ByteStrings. The module Control.Monad.Tokenizer provides more general types, but does not export a Tokenizable instance for ByteStrings, as its implementation depends on the encoding. This module assumes ASCII encoding (you have been warned!).

Example for a simple tokenizer, that splits words by whitespace and discards stop symbols:

tokenizeWords :: ByteString -> [ByteString] tokenizeWords = runTokenizer $ untilEOT $ do c <- pop if isStopSym c then discard else if c `elem` (" \t\r\n" :: [Char]) then discard else do walkWhile (\c -> (c=='_') || not (isSpace c || isPunctuation' c)) emit

Synopsis

type Tokenizer = Tokenizer ByteString
runTokenizer :: Tokenizer () -> ByteString -> [ByteString]
runTokenizerCS :: Tokenizer () -> ByteString -> [ByteString]
untilEOT :: Tokenizer () -> Tokenizer ()
peek :: Tokenizer Char
isEOT :: Tokenizer Bool
lookAhead :: [Char] -> Tokenizer Bool
walk :: Tokenizer ()
walkBack :: Tokenizer ()
pop :: Tokenizer Char
walkWhile :: (Char -> Bool) -> Tokenizer ()
walkFold :: a -> (Char -> a -> Maybe a) -> Tokenizer ()
emit :: Tokenizer ()
discard :: Tokenizer ()
restore :: Tokenizer ()

Monad

type Tokenizer = Tokenizer ByteString Source #

Tokenizer monad. Use runTokenizer or runTokenizerCS to run it

runTokenizer :: Tokenizer () -> ByteString -> [ByteString] Source #

Split a string into tokens using the given tokenizer

runTokenizerCS :: Tokenizer () -> ByteString -> [ByteString] Source #

Split a string into tokens using the given tokenizer, case sensitive version

untilEOT :: Tokenizer () -> Tokenizer () Source #

Repeat a given tokenizer as long as the end of text is not reached

Tests

peek :: Tokenizer Char Source #

Peek the current character

isEOT :: Tokenizer Bool Source #

Have I reached the end of the input text?

lookAhead :: [Char] -> Tokenizer Bool Source #

Check if the next input chars agree with the given string

Movement

walk :: Tokenizer () Source #

Proceed to the next character

walkBack :: Tokenizer () Source #

Walk back to the previous character, unless it was discarded/emitted.

pop :: Tokenizer Char Source #

Peek the current character and proceed

walkWhile :: (Char -> Bool) -> Tokenizer () Source #

Proceed as long as a given function succeeds

walkFold :: a -> (Char -> a -> Maybe a) -> Tokenizer () Source #

Proceed as long as a given fold returns Just (generalization of walkWhile)

Transactions

emit :: Tokenizer () Source #

Break at the current position and emit the scanned token

discard :: Tokenizer () Source #

Break at the current position and discard the scanned token

restore :: Tokenizer () Source #

Restore the state after the last emit/discard.

Orphan instances

Tokenizable ByteString Source #	Assuming ASCII encoding
Instance details Methods tnull :: ByteString -> Bool Source # thead :: ByteString -> Char Source # ttail :: ByteString -> ByteString Source # ttake :: Int -> ByteString -> ByteString Source # tdrop :: Int -> ByteString -> ByteString Source # tlower :: ByteString -> ByteString Source #