Read this post if you don't know what Bison is.
Today I'll try to parse JSON into AST and compare it with the native PHP function json_decode()
.
To test our parser I will use this JSON file:
test.json
{ "fieldString": "string", "fieldNumber": 99, "fieldBoolTrue": true, "fieldBoolFalse": false, "fieldNull": null, "fieldEmptyArray": [], "fieldEmptyObject": {}, "fieldArray": [ "string", 99, true, false, null, {}, [] ] }
First, we need to install PHP dependencies.
composer require --dev mrsuh/php-bison-skeleton composer require mrsuh/tree-printer composer require doctrine/lexer
- mrsuh/php-bison-skeleton - to build PHP parser with Bison
- mrsuh/tree-printer - to print
AST
- doctrine/lexer - to parse text into tokens
We will store our files like this:
. ├── /ast-parser ├── /bin │ └── parse.php # entry point to parse JSON ├── /lib │ └── parser.php # generated file ├── /src │ ├── Lexer.php │ └── Node.php # AST node └── grammar.y
The Node
class must implement Mrsuh\Tree\NodeInterface
to print AST
.
src/Node.php
<?php namespace App; use Mrsuh\Tree\NodeInterface; class Node implements NodeInterface { private string $name; private string $value; /** @var Node[] */ private array $children; public function __construct(string $name, string $value, array $children = []) { $this->name = $name; $this->value = $value; $this->children = $children; } public function getChildren(): array { return $this->children; } public function __toString(): string { if (!empty($this->value)) { return sprintf("%s: '%s'", $this->name, $this->value); } return $this->name; } }
I'll use the Doctrine lexer library. It helps to parse complex text.
src/Lexer.php
<?php namespace App; use Doctrine\Common\Lexer\AbstractLexer; class Lexer extends AbstractLexer implements LexerInterface { ... protected function getCatchablePatterns(): array { return [ '\:', '\{', '\}', '\[', '\]', '\,', "\"[^\"]+\"", 'true', 'false', 'null', ]; } protected function getNonCatchablePatterns(): array { return [ ' ', '\n' ]; } protected function getType(&$value): int { if (in_array($value, [':', '{', '}', '[', ']', ','], true)) { return ord($value); } if (is_numeric($value)) { return LexerInterface::T_NUMBER; } switch (strtolower($value)) { case 'true': case 'false': return LexerInterface::T_BOOL; case 'null': return LexerInterface::T_NULL; } return LexerInterface::T_STRING; } ... }
For example, Lexer
will translate the JSON
{ "array": [ "string", 99, true, false, null ] }
into this:
word | token |
---|---|
{ | ASCII (123) |
"array" | LexerInterface::T_STRING (258) |
: | ASCII (58) |
[ | ASCII (91) |
"string" | LexerInterface::T_STRING (258) |
, | ASCII (44) |
99 | LexerInterface::T_NUMBER (259) |
, | ASCII (44) |
true | LexerInterface::T_BOOL (260) |
, | ASCII (44) |
false | LexerInterface::T_BOOL (260) |
, | ASCII (44) |
null | LexerInterface::T_NULL (261) |
, | ASCII (44) |
] | ASCII (93) |
} | ASCII (125) |
LexerInterface::YYEOF (0) |
Time to create grammar.y
file and build lib/parser.php
PHP already has the native function json_decode()
and it uses Bison to generate a C parser.
I think we can get ready Bison grammar file from the php-src repository and modify it.
The grammar file is very small because JSON standard is very simple.
We will use block %code parser
to define variables and methods to store AST
into the Parser
class.
grammar.y
%define api.parser.class {Parser} %define api.namespace {App} %code parser { private Node $ast; public function setAst(Node $ast): void { $this->ast = $ast; } public function getAst(): Node { return $this->ast; } } %token T_STRING %token T_NUMBER %token T_BOOL %token T_NULL %% start: value { self::setAst($1); } ; object: '{' members '}' { $$ = $2; } ; members: %empty { $$ = []; } | member { $$ = [$1]; } | members ',' member { $$ = $1; $$[] = $3; } ; member: T_STRING ':' value { $$ = new Node('T_STRING', $1, [$3]); } ; array: '[' elements ']' { $$ = $2; } ; elements: %empty { $$ = []; } | value { $$ = [$1]; } | elements ',' value { $$ = $1; $$[] = $3; } ; value: object { $$ = new Node('T_OBJECT', '', $1); } | array { $$ = new Node('T_ARRAY', '', $1); } | T_STRING { $$ = new Node('T_STRING', $1); } | T_NUMBER { $$ = new Node('T_NUMBER', $1); } | T_BOOL { $$ = new Node('T_BOOL', $1); } | T_NULL { $$ = new Node('T_NULL', $1); } ; %%
bison -S vendor/mrsuh/php-bison-skeleton/src/php-skel.m4 -o lib/parser.php grammar.y
Command options:
-
-S vendor/mrsuh/php-bison-skeleton/src/php-skel.m4
- path toskeleton
file -
-o parser.php
- output parser file -
grammar.y
- our grammar file
The final PHP file is the entry point for the parser.
bin/parse.php
<?php require_once __DIR__ . '/../vendor/autoload.php'; use App\Parser; use App\Lexer; use Mrsuh\Tree\Printer; $lexer = new Lexer(fopen($argv[1], 'r')); $parser = new Parser($lexer); if (!$parser->parse()) { exit(1); } $printer = new Printer(); $printer->print($parser->getAst());
Autoload for generated lib/parser.php
file.
composer.json
{ "autoload": { "psr-4": { "App\\": "src/" }, "files": ["lib/parser.php"] }, ... }
Finally, we can test our parser.
php bin/parse.php test.json . ├── T_OBJECT ├── T_STRING: 'fieldString' │ └── T_STRING: 'string' ├── T_STRING: 'fieldNumber' │ └── T_NUMBER: '99' ├── T_STRING: 'fieldBoolTrue' │ └── T_BOOL: 'true' ├── T_STRING: 'fieldBoolFalse' │ └── T_BOOL: 'false' ├── T_STRING: 'fieldNull' │ └── T_NULL: 'null' ├── T_STRING: 'fieldEmptyArray' │ └── T_ARRAY ├── T_STRING: 'fieldEmptyObject' │ └── T_OBJECT └── T_STRING: 'fieldArray' └── T_ARRAY ├── T_STRING: 'string' ├── T_NUMBER: '99' ├── T_BOOL: 'true' ├── T_BOOL: 'false' ├── T_NULL: 'null' ├── T_OBJECT └── T_ARRAY
It works!
I think it will be cool if we compare the native json_decode()
function and our parser.
First, I need a JSON file for benchmarks. I can get JSON info about Bulbasaur pokemon from API https://pokeapi.co.
curl 'https://pokeapi.co/api/v2/pokemon/bulbasaur' > bench.json
The file weight is 215KB.
We need to modify our grammar.y
file to avoid Node
creating.
grammar-bench.y
... value: object { $$ = $1; } | array { $$ = $1; } | T_STRING { $$ = $1; } | T_NUMBER { $$ = $1; } | T_BOOL { $$ = $1; } | T_NULL { $$ = $1; } ...
bison -S ../../src/php-skel.m4 -o lib/parser.php grammar-bench.y
We are ready to start the comparison.
PHP 8.2
php vendor/bin/phpbench run tests --report=my-report +-------------+----------+----------+--------+ | subject | mem_peak | mode | rstdev | +-------------+----------+----------+--------+ | benchNative | 2.539mb | 1.570ms | ±0.89% | | benchBison | 12.443mb | 84.283ms | ±1.08% | +-------------+----------+----------+--------+
PHP 8.1
php vendor/bin/phpbench run tests --report=my-report +-------------+----------+----------+--------+ | subject | mem_peak | mode | rstdev | +-------------+----------+----------+--------+ | benchNative | 2.593mb | 1.595ms | ±0.68% | | benchBison | 18.471mb | 87.471ms | ±0.68% | +-------------+----------+----------+--------+
PHP 8.0
php vendor/bin/phpbench run tests --report=my-report +-------------+----------+----------+--------+ | subject | mem_peak | mode | rstdev | +-------------+----------+----------+--------+ | benchNative | 2.700mb | 1.586ms | ±0.90% | | benchBison | 18.578mb | 87.533ms | ±0.83% | +-------------+----------+----------+--------+
PHP 7.4
php vendor/bin/phpbench run tests --report=my-report +-------------+----------+-----------+--------+ | subject | mem_peak | mode | rstdev | +-------------+----------+-----------+--------+ | benchNative | 2.857mb | 1.725ms | ±1.00% | | benchBison | 18.735mb | 105.099ms | ±0.91% | +-------------+----------+-----------+--------+
PHP Bison parser shows the best result with PHP 8.2.
It is ~56 times slower than the native json_decode()
function.
I hope it was interesting for you!
You can get the parser source code here and test it by yourself.
Some useful links:
Top comments (0)