Performance boost for
Written the .
A lot of work has been done to boost the performance of
mainly by reducing the amount of memory required to lexically and to
syntactically analyse a datum. Up to 80% of the
memory can be saved with a large datum. Also a new feature allows
to compile the in-memory parser into a PHP class. Up to 45% of the
CPU can be saved for the parser initialisation.
Constant memory usage in the lexer
First of all, one of the biggest improvement lands in the lexer (aka the lexical analyser). The lexer is responsible of cutting a datum into a sequence of tokens (aka lexemes). So it reads the entire datum and produces an entire sequence. (Learn more by reading its documentation). Until today.
The issue with this approach is that if the datum is of length 1Mb, then the memory will contain 1Mb (the original datum) and at least 1Mb (the sequence). However, the sequence is represented as a PHP array, so it has an overhead of some bytes per tokens, in addition to the meta data attached to the token (like the name, the namespace, the length, the offset etc.). The resulting overhead size will depend on the PHP version in use but this is not optimal.
Thus, a straightforward optimisation is to transform the lexer into an iterator. Actually, it has been transformed into a generator. So each call to the lexer will produce the next token, without stacking all the tokens in memory.
With this new approach, the API stays identical and the memory peak usage is drastically reduced.
Save memory and CPU in the parser, and pragmas
Now, the parser takes the benefits from the lexer since it is an iterator. Method calls have been reduced, so many CPU cycles have been saved. Also, indirections have been reduced. When a datum points to another datum that points to the final one, we count 2 indirections to get the final datum. An indirection has a cost. This cost has been reduced.
Also, the lexer is wrapped inside a
iterator. This brings a new feature. One may remember that the parser is
Hoa\Compiler\Llk\Parser, so it is LL(k). But
it was LL(*). Now this is a real LL(k) and we can set
the value of k, thanks to the new feature introduced in the
grammar description language: Pragmas. Indeed, by writing:
%pragma parser.lookahead 0
we obtain a LL(0) parser, for free. If the parser needs to go beyond the value of k, an exception will be thrown.
And with a buffer iterator wrapping the lexer, this is still possible to move forward and backward in the lexer without lexing tokens several times.
Exporting the parser into PHP code
Hoa\Compiler\Llk is a compiler-compiler. What it means is
that given a grammar, it is compiled into a parser, and then the parser
is used to get a compiler. This first compilation, grammar to in-memory
parser, is not useful everytime. Actually, it must be done once before
going to production.
So far, the solution was to serialise the in-memory parser and save it into a file. Serialisation is dangerous from a security point of view. The source must be trusted. It was not a comfortable situation for our users.
Now, it is possible to save the in-memory parser as a string representing PHP code. This PHP code instanciates the same parser (at its initial state). The parser takes the form of a PHP class. Thus, one might write this PHP code into a file, commit this file and use it as any regular PHP classes, with autoloaders and so on.
Disabling Unicode support
Another pragma has been introduced to disable Unicode support in the lexer:
%pragma lexer.unicode false
This is particularly useful when the grammar defines a binary language or
defines its own Unicode support. This is the case of
The RFC7159 is under fully implementation and JSON uses the same UTF-8
correctly lexically analyse a JSON string with PCRE Unicode support
enabled. With this new pragma, this is possible.
Also, the JSON grammar defines a LL(0) parser thanks to the
Quality with Grammar-based Testing algorithms
Integration test suites were landing in the
library, but there was no unit test suites. This uncomfortable situation
is now fixed. New integration test suites have been written too. Now we
- 24 test suites,
- 136 test cases,
- 320,242 assertions.
Yes, this is not a mistake: 320,242 assertions. We obtain this number by
using Grammar-based Testing algorithms, defined in the following research
Testing using Realistic Domains in PHP. This paper has been
written by the authors of the
Hoa\Compiler library. You might
enjoy reading this article.
Finally, 2 bugs have been found and fixed.
We still have ideas for new optimisations.
more and more used. Recent example is
RulerZ. We have also heard
best quality as much as possible.