| Safe Haskell | Safe |
|---|---|
| Language | Haskell2010 |
Control.Monad.Tokenizer.Char8.Strict
Description
A monad for writing pure tokenizers in an imperative-looking way.
Main idea: You walk through the input string like a turtle, and everytime you find a token boundary, you call emit. If some specific kinds of tokens should be suppressed, you can discard them instead (or filter afterwards).
This module supports is specialized for strict ByteStrings. The module Control.Monad.Tokenizer provides more general types, but does not export a Tokenizable instance for ByteStrings, as its implementation depends on the encoding. This module assumes ASCII encoding (you have been warned!).
Example for a simple tokenizer, that splits words by whitespace and discards stop symbols:
tokenizeWords :: ByteString -> [ByteString] tokenizeWords = runTokenizer $ untilEOT $ do c <- pop if isStopSym c then discard else if c `elem` (" \t\r\n" :: [Char]) then discard else do walkWhile (\c -> (c=='_') || not (isSpace c || isPunctuation' c)) emitSynopsis
- type Tokenizer = Tokenizer ByteString
- runTokenizer :: Tokenizer () -> ByteString -> [ByteString]
- runTokenizerCS :: Tokenizer () -> ByteString -> [ByteString]
- untilEOT :: Tokenizer () -> Tokenizer ()
- peek :: Tokenizer Char
- isEOT :: Tokenizer Bool
- lookAhead :: [Char] -> Tokenizer Bool
- walk :: Tokenizer ()
- walkBack :: Tokenizer ()
- pop :: Tokenizer Char
- walkWhile :: (Char -> Bool) -> Tokenizer ()
- walkFold :: a -> (Char -> a -> Maybe a) -> Tokenizer ()
- emit :: Tokenizer ()
- discard :: Tokenizer ()
- restore :: Tokenizer ()
Monad
type Tokenizer = Tokenizer ByteString Source #
Tokenizer monad. Use runTokenizer or runTokenizerCS to run it
runTokenizer :: Tokenizer () -> ByteString -> [ByteString] Source #
Split a string into tokens using the given tokenizer
runTokenizerCS :: Tokenizer () -> ByteString -> [ByteString] Source #
Split a string into tokens using the given tokenizer, case sensitive version
untilEOT :: Tokenizer () -> Tokenizer () Source #
Repeat a given tokenizer as long as the end of text is not reached
Tests
lookAhead :: [Char] -> Tokenizer Bool Source #
Check if the next input chars agree with the given string
Movement
walkBack :: Tokenizer () Source #
Walk back to the previous character, unless it was discarded/emitted.
walkFold :: a -> (Char -> a -> Maybe a) -> Tokenizer () Source #
Proceed as long as a given fold returns Just (generalization of walkWhile)
Transactions
Orphan instances
| Tokenizable ByteString Source # | Assuming ASCII encoding |
Methods tnull :: ByteString -> Bool Source # thead :: ByteString -> Char Source # ttail :: ByteString -> ByteString Source # ttake :: Int -> ByteString -> ByteString Source # tdrop :: Int -> ByteString -> ByteString Source # tlower :: ByteString -> ByteString Source # | |