A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.mail-archive.com/html5lib-discuss@googlegroups.com/msg00321.html below:

Re: Review of PHP Implementation

On 22 Mar 2009, at 00:19, Edward Z. Yang wrote:
>> Also, the unit tests don't run on a fresh checkout, which makes me  
>> not really want to try and change what is listed below. If you can  
>> get them working, I'll have a shot at some of the below (a lot of  
>> which apply all over the place).
>
> The testing framework was not documented, my apologies. If you  
> create a
> test-settings.php file in the PHP directory and set  
> $simpletest_location
> to the location of SimpleTest (you have to use the latest SVN checkout
> of that project), it will work.

That seems to work. I added the requirement for SVN in the README.

>> <http://hg.gsnedders.com/unicode/> has a UTF-8 decoder that can be  
>> used as a base for our own implementation.
>
> Ok. How do you suggest including it in html5lib?

Basically, we have two choices: we either decode the UTF-8 string and  
instantly re-encode it as UTF-8, which has the side-effect of having  
to re-decode it again to get a character offset for column position.  
The alternative is that we decode it to an array of UTF-32 code units:  
this would mean we would simply count incrementally as we moved over  
the array to keep track of column, and we would only serialize when we  
had to (e.g., when interacting with DOM).

>> Is there any reason to actually track column normally? I can  
>> understand wanting it on parse-errors, but in that case I'd rather  
>> just calculate it on-error, and not take the cost of calculating it  
>> normally.
>
> I'm not convinced it's possible to calculate it on-error, since many
> errors happen after tokenization has already occured. I know at least
> the Python implementation always calculates this.

Would having a method that calculated it not be equally usable as it  
is now? Remember the Python impl. has the advantage of not having to  
deal with UTF-8 in interpreted code, which helps massively on this  
(indeed, it is probably the right choice for the Python implementation).

>> I guess this would work if you could just get away with setting the  
>> locale in Tokenizer::parse and then just changing it back (how? — I  
>> see no way to get the initial value) at the end.
>
> So, I *really* wish the ctype functions would just be for the C  
> locale,
> all the time. I think I've used them improperly in HTML Purifier too.
>
> I don't see any way to get it back either.

It's horrible, but there again a lot of PHP's (non-) design is for  
that matter. :) The worst thing I've had to do is re-implement  
strtolower()/strtoupper() to work for only ASCII characters, which  
made what I had to do dog-slow. Yay. :\

>>>   private function bogusCommentState() {
>>>       /* (This can only happen if the content model flag is set to  
>>> the PCDATA state.) */
>>
>> Can we add an assert to check such statements?
>
> Unlike compiled languages, where asserts can be removed at
> compile-timer, adding an assert would just be annoying to the end- 
> user,
> and not really help us flush out bugs.

That's not true. See the assert.active ini option and assert_options()..

>> We should create elements in the HTML namespace.
>
> I thought HTML5 didn't believe in namespaces? (I suppose I haven't  
> been
> following the WHATWG discussion closely enough).

See <http://www.whatwg.org/specs/web-apps/current-work/#insert-an-html-element 
 > for the HTML case. It creates all the elements in their correct  
namespace in the DOM, though there is no way to explicitly set a  
namespace in HTML.

--
Geoffrey Sneddon
<http://gsnedders.com/>


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
 To post to this group, send email to html5lib-discuss@googlegroups.com
 To unsubscribe from this group, send email to 
html5lib-discuss+unsubscr...@googlegroups.com
 For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4