Open In App

Token, Patterns, and Lexemes

Last Updated : 23 Jul, 2025
Suggest changes
Share
31 Likes
Like
Report

In computer science, it is important for the programmer to understand the various basic elements that compose programming languages. These include tokens, patterns, and lexemes, among others, which are essential in parsing and interpreting code.

In lexical analysis, tokens, patterns, and lexemes are key concepts. A token is a category, like a keyword or identifier, representing units of meaning. A pattern defines the structure that matches a token, such as a regular expression for an identifier. A lexeme is the actual sequence of characters in the source code that matches a token’s pattern.

A compiler is system software that translates the source program written in a high-level language into a low-level language. The compilation process of source code is divided into several phases in order to ease the process of development and design. The phases work in sequence as the output of the previous phase is utilized in the next phase. The various phases are as follows:

Lexical Analysis Phase

In this phase, input is the source program that is to be read from left to right and the output we get is a sequence of tokens that will be analyzed by the next Syntax Analysis phase. During scanning the source code, white space characters, comments, carriage return characters, preprocessor directives, macros, line feed characters, blank spaces, tabs, etc. are removed. The Lexical analyzer or Scanner also helps in error detection. To exemplify, if the source code contains invalid constants, incorrect spelling of keywords, etc. is taken care of by the lexical analysis phase. Regular expressions are used as a standard notation for specifying tokens of a programming language. 

What is a Token?

In programming, a token is the smallest unit of meaningful data; it may be an identifier, keyword, operator, or symbol. A token represents a series or sequence of characters that cannot be decomposed further. In languages such as C, some examples of tokens would include:

  • Keywords : Those reserved words in C like ` int `, ` char `, ` float `, ` const `, ` goto `, etc.
  • Identifiers: Names of variables and user-defined functions.
  • Operators : ` + `, ` - `, ` * `, ` / `, etc.
  • Delimiters /Punctuators: Symbols used such as commas " , " semicolons " ; " braces ` {} `.

By and large, tokens may be divided into three categories:

  • Terminal Symbols (TRM) : Keywords and operators.
  • Literals (LIT) : Values like numbers and strings.
  • Identifiers (IDN) : Names defined by the user.

Let's understand now how to calculate tokens in a source code (C language):

Example 1:

int a = 10; //Input Source code 

Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)

Answer - Total number of tokens = 5

Example 2:

int main() {

// printf() sends the string inside quotation to
// the standard output (the display)
printf("Welcome to GeeksforGeeks!");
return 0;
}
Tokens
'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to GeeksforGeeks!" ',
')', ';', 'return', '0', ';', '}'

Answer - Total number of tokens = 14

What is a Lexeme?

A lexeme is a sequence of source code that matches one of the predefined patterns and thereby forms a valid token. For example, in the expression `x + 5`, both `x` and `5` are lexemes that correspond to certain tokens. These lexemes follow the rules of the language in order for them to be recognized as valid tokens.

Example:

main is lexeme of type identifier(token)

(,),{,} are lexemes of type punctuation(token)

What is a Pattern?

A pattern is a rule or syntax that designates how tokens are identified in a programming language. In fact, it is supposed to specify the sequences of characters or symbols that make up valid tokens, and provide guidelines as to how to identify them correctly to the scanner.

Example of Programming Language (C, C++)

For a keyword to be identified as a valid token, the pattern is the sequence of characters that make the keyword.

For identifier to be identified as a valid token, the pattern is the predefined rules that it must start with alphabet, followed by alphabet or a digit.

Difference Between Token, Lexeme, and Pattern

CriteriaTokenLexemePattern
DefinitionToken is basically a sequence of characters that are treated as a unit as it cannot be further broken down.It is a sequence of characters in the source code that are matched by given predefined language rules for every lexeme to be specified as a valid token. It specifies a set of rules that a scanner follows to create a token.
Interpretation of type Keyword all the reserved keywords of that language(main, printf, etc.)int, gotoThe sequence of characters that make the keyword.
Interpretation of type Identifiername of a variable, function, etcmain, ait must start with the alphabet, followed by the alphabet or a digit.
Interpretation of type Operatorall the operators are considered tokens.+, =+, =
Interpretation of type Punctuation each kind of punctuation is considered a token. e.g. semicolon, bracket, comma, etc. (, ), {, }(, ), {, }
Interpretation of type Literal a grammar rule or boolean literal."Welcome to GeeksforGeeks!"any string of characters (except ' ') between " and "

Output of Lexical Analysis Phase

The output of Lexical Analyzer serves as an input to Syntax Analyzer as a sequence of tokens and not the series of lexemes because during the syntax analysis phase individual unit is not vital but the category or class to which this lexeme belongs is considerable. 

Example:

z = x + y;
This statement has the below form for syntax analyzer
<id> = <id> + <id>; //<id>- identifier (token)

The Lexical Analyzer not only provides a series of tokens but also creates a Symbol Table that consists of all the tokens present in the source code except Whitespaces and comments.

Conclusion

Tokens, patterns, and lexemes represent basic elements of any programming language, helping to break down and start making sense of code. Tokens are the basic units of meaningful things; patterns define how such units are identified, whereas the lexemes are actual sequences that match patterns. Basically, understanding these concepts is indispensable in programming and analyzing codes efficiently.


Explore