Also see <http://wiki.whatwg.org/wiki/Parser_tests>.
On 20/07/2013 12:04, Mohammad Houssami wrote:
"on" is mapped to false, "off" to true. Sure, we could use srings, but I'm not sure that changing this is worthwhile given the number of implementations already using the tests (as well as being a fair amount of work to change!).1- I am currently working on building and HTML5 parser according to the specs of WHATWG and while testing the tokenizer using the tests on the HTML5Lib i have noticed some of them have bugs. In general these are the major things i have noticed: I am refering to this set of tests :http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/test1.test There are also some similar stuff in test2 test3 and test4 but lets just stick with test1 for now. I have noticed that in places where you have doctype tokens like this one : {"description":"Correct Doctype lowercase", "input":"<!DOCTYPE html>", "output":[["DOCTYPE", "html", null, null, true]]} The force quirck flag is set to true where as the specifications say its usually on or off. Like the example in the EOF here :http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#doctype-state
The tests compress adjacent character tokens down to one token, as most implementations do, as it makes using the tests simpler. This can obviously be reversed for something doing a 1:1 implementation of the spec.2- In places where character tokens follow the tokenizer gives 1 character token with all the character as data when it should give 1 character token for every single character. Here is an example. {"description":"Ampersand ampersand EOF", "input":"&&", "output":[["Character", "&&"]]} My expected output for this is having 2 character tokens each with ampersand data rather than just 1 token.
3- Assuming true stands for on and false for off, many quirck flags are inverted where true(on) is given then it has to be false(off). The earlier case I gave is an example. The states that should be covered with this input are the following: DataState: <!DOCTYPE html> Tag open state: <!DOCTYPE html> Markup deceleration open state: <!DOCTYPE html> Doctype State: : <!DOCTYPE html> Before doctype name state: <!DOCTYPE html> Doctype name state: <!DOCTYPE html> Doctype name state: <!DOCTYPE html> Doctype name state: <!DOCTYPE html> Doctype name state: <!DOCTYPE html> The state says the following : U+003E GREATER-THAN SIGN (>) Switch to the data state<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state>. Emit the current DOCTYPE token. And then in data state the EOF is read so there is nothing about the force-quirck flag and the specifications say the following : " When a DOCTYPE token is created, its name, public identifier, and system identifier must be marked as missing (which is a distinct state from the empty string), and the *force-quirks flag* must be set to *off* (its other state is *on*)." So by default it has to be off(false).
See above, true/false map the non-obvious way.
No, the tokenizer tests merely those tokens passed to the parser. You need no parser to run them (though make sure you start in the right initial state, which I believe is the "initialState" property (defaulting to the data state, fairly obviously).Now there is one thing I am not certain about and is if this output is the output after the parsing happens because I am testing the tokenizer without any of the tree constructions stages and this might be the problem.
HTH, Geoffrey. -- You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to html5lib-discuss+unsubscr...@googlegroups.com. To post to this group, send an email to html5lib-discuss@googlegroups.com. Visit this group at http://groups.google.com/group/html5lib-discuss. For more options, visit https://groups.google.com/groups/opt_out.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4