In this wikitech-l thread from August 2, 2016, I (Subbu Sastry) outlined how one might get to a formal spec for wikitext, but whether we want a spec for wikitext, and whether we ought to work towards that, is a different question. Rob Lanphier, in a followup, proposed a ArchCom discussion on this question, i.e. whether there ought to be a formal spec for wikitext. A spec for wikitext is also one of the proposals for how to deal with the 2-systems problem we have with wikitext parsing.
With the goal of guiding that ArchCom discussion, in this note, I want to address this question a bit, i.e. whether there ought to be a format spec for wikitext, and whether it is a worthwhile goal to work towards. The most important argument I want to make here is that this question of whether we want a spec cannot be separated from the question of what the goals are of wanting a spec (or put another way, what might a formal spec give us?). While that is probably a somewhat obvious observation for most of you, I want to make that explicit so we don't lose track of it in our discussions.
Complexities of "parsing" wikitext: The Parsoid Experience[edit]As I noted in passing in the preamble to that wikitech-l thread, when talking about wikitext and the difficulty of parsing wikitext, there is often a focus on how there doesn't exist a BNF grammar for wikitext and how it is context-sensitive, etc. However, I think that is not as relevant compared to what I think are other sources of complexity in wikitext parsing. I alluded to them in that previously referenced email, but I am organizing them here a bit more coherently than in the email. These come from our experience developing (which is ongoing) Parsoid to support its editing clients and achieve parity with the output of the core parser.
You could replace wikitext with markdown and none of the problems outlined below would be solved. That is the reason I think that the focus ought to be on these other issues more than syntax.
Some of the confusion and imprecise discussion around a spec is also a problem of nomenclature. It is easier to see this by looking a traditional language compiler. A traditional language compiler has many very distinct architectural phases. There is the parsing phase (traditionally comprising of a lexer and a parser), there is a semantic analysis phase (type checking, etc.), there is a code optimization phase, and there is a code generation phase. The parser is just one part of the pipeline and later passes build on its output. Parsing is a very limited part of the pipeline that takes a source level program to executable code.
Based on the previous section, it should be fairly clear that a well-specific BNF grammar (for wikitext) or using Markdown with a clean grammar does not really help building an alternative wikitext "runtime" (the blurb on the Parsoid page refers to Parsoid as a bidirectional wikitext "runtime"). However, it is not really a runtime.
In today's MediaWiki incarnation as it is used in the Wikimedia universe, we need to be able to transform wikitext to HTML and transform HTML to wikitext (sometimes with additional constraints when the HTML is derived from an edit).
Given these observations, "parser" is a fairly loose and inaccurate term for what we might try to develop a spec for. In any case, this discussion of nomenclature is once again to expand our focus beyond syntactic details and variations to the full process of transforming wikitext to HTML and vice versa.
Why come up with a spec[edit]Some goals for writing a spec could be:
But, while we could try to spec the wikitext -> HTML and HTML -> wikitext behavior as it exists today in the core parser and Parsoid, it gets us nothing very much beyond some documentation. As described in the earlier section, the processing model continues to be complicated enough to discourage any forays into alternative implementations.
But, if one approaches a spec with the goal of identifying unnecessary and accidental complexity that has crept in over the years and evolving a newer processing model, it seems far more useful to me. For example, one approach would be to come up with a spec that aims to enable document composition from smaller fragments and thus provides a structure to the document in terms of its nested fragments. This enables a bunch of things in turn:
The resulting simplification may also enable pluggable third party implementations of the bidirectional transformations for different resource constraints. I think a shared hosting wiki and a Wikimedia wiki are entirely different beasts and there ought to be different options for what kind of "wikitext runtime" is suitable for each of those scenarios without imposing the entire development burden on WMF. Maybe naively, I think a spec could provide a somewhat elegant solution to the vexing problem of 3rd party wikis, but, for sure, it only solves part of that problem.
Separately, it may enable tools that require wikitext parsing that don't have to be called mwparserfromhell.
What kind of specs can we develop?[edit]There are different kinds of specs that exist out there.
In the course of developing Parsoid, test coverage has expanded to spec wikitext usage as seen in various wikis as well as specify behavior of the conversion process from edited HTML to wikitext. So, what we now have for wikitext is a spec based on test coverage.
The goals from the previous section could dictate the form that a spec could take.
Of course, all these different forms above are not mutually exclusive. They target different audiences.
Other relevant documents:
These are likely topics at the Wikimedia Parsing Team's offsite in October 2016.
While I am the initial author of this document, much of this comes from experience from collective work on the Parsoid project, and lots of discussions within and outside the parsing team.
Arlo, Rob, Kunal encouraged the idea of thinking about a spec in the context of the 2-systems problem.
Rob especially emphasized that the 2-systems problem should be seen not as a problem, but as an opportunity for making some forward progress on a long-standing problem, i.e. the 2-systems problem can clarify the inherent and accidental complexities of the current wikitext processing model.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4