On 22 Mar 2009, at 00:19, Edward Z. Yang wrote:
>> Also, the unit tests don't run on a fresh checkout, which makes me >> not really want to try and change what is listed below. If you can >> get them working, I'll have a shot at some of the below (a lot of >> which apply all over the place). > > The testing framework was not documented, my apologies. If you > create a > test-settings.php file in the PHP directory and set > $simpletest_location > to the location of SimpleTest (you have to use the latest SVN checkout > of that project), it will work. That seems to work. I added the requirement for SVN in the README. >> <http://hg.gsnedders.com/unicode/> has a UTF-8 decoder that can be >> used as a base for our own implementation. > > Ok. How do you suggest including it in html5lib? Basically, we have two choices: we either decode the UTF-8 string and instantly re-encode it as UTF-8, which has the side-effect of having to re-decode it again to get a character offset for column position. The alternative is that we decode it to an array of UTF-32 code units: this would mean we would simply count incrementally as we moved over the array to keep track of column, and we would only serialize when we had to (e.g., when interacting with DOM). >> Is there any reason to actually track column normally? I can >> understand wanting it on parse-errors, but in that case I'd rather >> just calculate it on-error, and not take the cost of calculating it >> normally. > > I'm not convinced it's possible to calculate it on-error, since many > errors happen after tokenization has already occured. I know at least > the Python implementation always calculates this. Would having a method that calculated it not be equally usable as it is now? Remember the Python impl. has the advantage of not having to deal with UTF-8 in interpreted code, which helps massively on this (indeed, it is probably the right choice for the Python implementation). >> I guess this would work if you could just get away with setting the >> locale in Tokenizer::parse and then just changing it back (how? — I >> see no way to get the initial value) at the end. > > So, I *really* wish the ctype functions would just be for the C > locale, > all the time. I think I've used them improperly in HTML Purifier too. > > I don't see any way to get it back either. It's horrible, but there again a lot of PHP's (non-) design is for that matter. :) The worst thing I've had to do is re-implement strtolower()/strtoupper() to work for only ASCII characters, which made what I had to do dog-slow. Yay. :\ >>> private function bogusCommentState() { >>> /* (This can only happen if the content model flag is set to >>> the PCDATA state.) */ >> >> Can we add an assert to check such statements? > > Unlike compiled languages, where asserts can be removed at > compile-timer, adding an assert would just be annoying to the end- > user, > and not really help us flush out bugs. That's not true. See the assert.active ini option and assert_options().. >> We should create elements in the HTML namespace. > > I thought HTML5 didn't believe in namespaces? (I suppose I haven't > been > following the WHATWG discussion closely enough). See <http://www.whatwg.org/specs/web-apps/current-work/#insert-an-html-element > for the HTML case. It creates all the elements in their correct namespace in the DOM, though there is no way to explicitly set a namespace in HTML. -- Geoffrey Sneddon <http://gsnedders.com/> --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To post to this group, send email to html5lib-discuss@googlegroups.com To unsubscribe from this group, send email to html5lib-discuss+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB -~----------~----~----~----~------~----~------~--~---
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4