Token, Patterns, and Lexemes
Last Updated : 23 Jul, 2025
In computer science, it is important for the programmer to understand the various basic elements that compose programming languages. These include tokens, patterns, and lexemes, among others, which are essential in parsing and interpreting code.
A compiler is system software that translates the source program written in a high-level language into a low-level language. The compilation process of source code is divided into several phases in order to ease the process of development and design. The phases work in sequence as the output of the previous phase is utilized in the next phase. The various phases are as follows:
Lexical Analysis Phase
In this phase, input is the source program that is to be read from left to right and the output we get is a sequence of tokens that will be analyzed by the next Syntax Analysis phase. During scanning the source code, white space characters, comments, carriage return characters, preprocessor directives, macros, line feed characters, blank spaces, tabs, etc. are removed. The Lexical analyzer or Scanner also helps in error detection. To exemplify, if the source code contains invalid constants, incorrect spelling of keywords, etc. is taken care of by the lexical analysis phase. Regular expressions are used as a standard notation for specifying tokens of a programming language.
What is a Token?
In programming, a token is the smallest unit of meaningful data; it may be an identifier, keyword, operator, or symbol. A token represents a series or sequence of characters that cannot be decomposed further. In languages such as C, some examples of tokens would include:
- Keywords : Those reserved words in C like ` int `, ` char `, ` float `, ` const `, ` goto `, etc.
- Identifiers: Names of variables and user-defined functions.
- Operators : ` + `, ` - `, ` * `, ` / `, etc.
- Delimiters /Punctuators: Symbols used such as commas " , " semicolons " ; " braces ` {} `.
By and large, tokens may be divided into three categories:
- Terminal Symbols (TRM) : Keywords and operators.
- Literals (LIT) : Values like numbers and strings.
- Identifiers (IDN) : Names defined by the user.
Let's understand now how to calculate tokens in a source code (C language):
Example 1:
int a = 10; //Input Source code
Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)
Answer - Total number of tokens = 5
Example 2:
int main() {
// printf() sends the string inside quotation to
// the standard output (the display)
printf("Welcome to GeeksforGeeks!");
return 0;
}
Tokens
'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to GeeksforGeeks!" ',
')', ';', 'return', '0', ';', '}'Answer - Total number of tokens = 14
What is a Lexeme?
A lexeme is a sequence of source code that matches one of the predefined patterns and thereby forms a valid token. For example, in the expression `x + 5`, both `x` and `5` are lexemes that correspond to certain tokens. These lexemes follow the rules of the language in order for them to be recognized as valid tokens.
Example:
main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)
What is a Pattern?
A pattern is a rule or syntax that designates how tokens are identified in a programming language. In fact, it is supposed to specify the sequences of characters or symbols that make up valid tokens, and provide guidelines as to how to identify them correctly to the scanner.
Example of Programming Language (C, C++)
For a keyword to be identified as a valid token, the pattern is the sequence of characters that make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that it must start with alphabet, followed by alphabet or a digit.
Difference Between Token, Lexeme, and Pattern
| Criteria | Token | Lexeme | Pattern |
|---|
| Definition | Token is basically a sequence of characters that are treated as a unit as it cannot be further broken down. | It is a sequence of characters in the source code that are matched by given predefined language rules for every lexeme to be specified as a valid token. | It specifies a set of rules that a scanner follows to create a token. |
| Interpretation of type Keyword | all the reserved keywords of that language(main, printf, etc.) | int, goto | The sequence of characters that make the keyword. |
| Interpretation of type Identifier | name of a variable, function, etc | main, a | it must start with the alphabet, followed by the alphabet or a digit. |
| Interpretation of type Operator | all the operators are considered tokens. | +, = | +, = |
| Interpretation of type Punctuation | each kind of punctuation is considered a token. e.g. semicolon, bracket, comma, etc. | (, ), {, } | (, ), {, } |
| Interpretation of type Literal | a grammar rule or boolean literal. | "Welcome to GeeksforGeeks!" | any string of characters (except ' ') between " and " |
Output of Lexical Analysis Phase
The output of Lexical Analyzer serves as an input to Syntax Analyzer as a sequence of tokens and not the series of lexemes because during the syntax analysis phase individual unit is not vital but the category or class to which this lexeme belongs is considerable.
Example:
z = x + y;
This statement has the below form for syntax analyzer
<id> = <id> + <id>; //<id>- identifier (token)
The Lexical Analyzer not only provides a series of tokens but also creates a Symbol Table that consists of all the tokens present in the source code except Whitespaces and comments.
Conclusion
Tokens, patterns, and lexemes represent basic elements of any programming language, helping to break down and start making sense of code. Tokens are the basic units of meaningful things; patterns define how such units are identified, whereas the lexemes are actual sequences that match patterns. Basically, understanding these concepts is indispensable in programming and analyzing codes efficiently.
Explore
Compiler Design Basics
Lexical Analysis
Syntax Analysis & Parsers
Syntax Directed Translation & Intermediate Code Generation
Code Optimization & Runtime Environments
Practice Questions
My Profile