Markup Blitz is a parser generator implemented in Java. It takes a context-free grammar, generates a parser for it, and uses the parser to convert input into a parse tree, provided that the input conforms to the grammar.
The algorithms used are
It also uses concepts from REx Parser Generator.
Markup Blitz is an implementation of Invisible XML (ixml). Please see the ixml specification for details on grammar notation and parse-tree serialization.
Use JDK 11 or higher. For building Markup Blitz, clone this GitHub repository and go to the resulting directory:
git clone https://github.com/GuntherRademacher/markup-blitz.git
cd markup-blitz
Then run this command, on Unix/Linux,
./gradlew clean jar
or this one, on Windows
gradlew clean jar
This creates build/libs/markup-blitz.jar
which serves the Markup Blitz API. It is also usable as an executable jar for standalone execution.
For running the tests, use this command on Unix/Linux,
./gradlew test
or this one on Windows:
gradlew test
Markup Blitz comes with a few tests, but it also passes all of the 5168 tests in the Invisible XML community project ixml. For running those as well, make sure that the ixml project is available next to the markup-blitz project and use the above command. Executing these tests takes a few minutes, however some of the performance tests are skipped by default, because they require significantly more time or memory. For running those as well, set property ALL_TESTS
to true
:
./gradlew test -PALL_TESTS=true
(on Windows omit the leading ./
). Note that this causes JVM arguments to be set for a heap size of 16GB and a stack size of 4MB. Execution may take more than half an hour in this case.
The project can be imported into Eclipse as a Gradle project.
Markup Blitz is available on Maven Central with groupId de.bottlecaps
and artifactId markup-blitz
.
Markup Blitz can be run from command line to process some input according to an Invisible XML grammar:
Usage: java -jar markup-blitz.jar [<OPTION>...] [<GRAMMAR>] <INPUT>
Compile an Invisible XML grammar, and parse input with the resulting parser.
<GRAMMAR> the grammar (literal, file name or URL), in ixml notation.
When omitted, the ixml grammar will be used.
<INPUT> the input (literal, file name or URL).
<OPTION>:
--indent generate resulting xml with indentation.
--trace print parser trace.
--fail-on-error throw an exception instead of returning an error document.
--timing print timing information.
--verbose print intermediate results.
A literal grammar or input must be preceded by an exclamation point (!).
All inputs must be presented in UTF-8 encoding, and output is written in
UTF-8 as well. Resulting XML goes to standard output, all diagnostics go
to standard error.
BaseX uses Markup Blitz to implement fn:invisible-xml
. It can be tested online on BXFiddle.
A parser is generated by passing an Invisible XML grammar as a string to the generate
method of class de.bottlecaps.markup.Blitz
. This returns an object of type de.bottlecaps.markup.blitz.Parser
, which has a parse
method accepting the input string as a parameter and returns the resulting XML as a string.
Generate a parser from an Invisible XML grammar in ixml notation.
public static Parser generate(String grammar, Option... blitzOptions) throws BlitzException
Parameters:
String grammar
: the Invisible XML grammar in ixml notationOption... blitzOptions
: options for use at generation time and parsing time
Returns: Parser
: the generated parser
Throws: BlitzException
: if any error is detected while generating the parser
Parse the given input.
public String parse(String input, Option... options)
Parameters:
String input
: the input stringOption options
: options for use at parsing time. If absent, any options passed at generation time will be in effect
Returns: String
: the resulting XML
Either of the generate
and parse
methods accepts Option
arguments for creating extra diagnostic output. Generation time options are passed to the Parser
object implicitly, and they are used at parsing time, when parse
is called without any options.
/** Parser and generator options. */
public enum Option {
/** Parser option: Generate XML with indentation. */ INDENT,
/** Parser option: Print parser trace. */ TRACE,
/** Parser option: Fail on parsing error. */ FAIL_ON_ERROR,
/** Generator option: Print timing information. */ TIMING,
/** Generator option: Print information on intermediate results. */ VERBOSE;
}
As with REx Parser Generator, the goal of Markup Blitz is to provide good performance. In general, however, REx parsers can be expected to perform much better. This is primarily because REx allows separating the specification into tokenization and parsing steps. This is in contrast to Invisible XML, which uses a uniform grammar to resolve from the start symbol down to codepoint level. Separate tokenization enables the use of algorithms optimized for this purpose, the establishment of token termination rules, and the easy accommodation of whitespace rules. Without it, all of this has to be accomplished by the parser alone, which often leads to costly handling of local ambiguities.
Some performance comparison of REx-generated parsers and Invisible XML parsers can be found in the rex-parser-benchmark project. This also covers a measurement of Markup Blitz performance.
Copyright (c) 2023-2024 Gunther Rademacher. Markup Blitz is provided under the Apache 2 License.
The work in this project was supported by the BaseX organization.