The WHATWG HTML Draft Recommendation has § 8.2 Parsing HTML documents that “only applies to user agents, data mining tools, and conformance checkers.” However, this matters to all of us because it takes implementations of a spec to forward the spec.
Error handling is a large part of parsing:
This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined: user agents must either act as described below when encountering such problems, or must abort processing at the first error that they encounter for which they do not wish to apply the rules described below.
Anyways, I’m writing about this because Henri Sivonen has a “preliminary build” of the HTML 5 parser for Gecko, which is celebrated as “a step out of the vaporware land” for this effort. It’s admittedly early going and not consumer friendly yet, but for those of us that care about the future of browsers this is a definitely important step.
The famous and respected Sam Ruby discusses it more intelligently than I can hope to:
Henri’s approach is interesting. He starts from a single source, in Java. The Java code can be compiled to Java byte codes, JavaScript source, or C++ presumably making use of Mozilla libraries for things such as memory management. If he can do that, it seems to me to be a rather small leap from there to producing C++ using, say, either Ruby or Python libraries for memory management, as well as a thin binding to the language. C# would also be a reasonable target.
For the more adventurous, follow along with some of the comments to Sam’s post. Some nicely techie things discussed.