I have uploaded a patch to address the performance problem that excors and graham described in http://code.google.com/p/html5lib/issues/detail?id=35. The data structure (list) used to keep unread characters has O(n) cost for insertion to/deletion from the head of the list, so frequent unget() (which happens when parsing a long sequence of number entities) can result in poor performance. Changing the data structure to deque, a queue object with O(1) insertion/deletion cost at the head, improves the performance in general and makes char()/unget() scales better with long input size. deque is a python library object in python 2.4 and after.
The running time of excors's code is still not linear to the input size, but the additional overhead is mostly from appendChild() in simpletree and is not related to HTMLInputStream. I also started another issue (http://code.google.com/p/html5lib/issues/ detail?id=53) with a patch for an alternative implementation of position() and lineLengths. Follow the general idea in the comment, I keep a buffer of read characters and compute the col, line, lineLengths when the corresponding functions are called. Even though the overhead is large (a buffer of input size), but it should be correct. Removing position tracking from each call to char()/unget()/ charsUntil() also improve the performance. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To post to this group, send email to html5lib-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB -~----------~----~----~----~------~----~------~--~---
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4