On 12/10/2011 4:32 PM, Glyph Lefkowitz wrote: > On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote: > >> Note, however, that html5lib is likely way too big to add it to the >> stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML >> in Python 3, which would be the target release series for better HTML >> support. So, whatever library or API you would want to use for HTML >> processing is currently only the second question as long as Py3 lacks >> a real-world HTML parser in the stdlib, as well as a robust character >> detection mechanism. I don't think that can be fixed all that easily. > > Here's the problem in a nutshell, I think: > > 1. Everybody wants an HTML parser in the stdlib, because it's > inconvenient to pull in a dependency for such a "simple" task. > 2. Everybody wants the stdlib to remain small, stable, and simple and > not get "overcomplicated". > 3. Parsing arbitrary HTML5 is a monstrously complex problem, for which > there exist rapidly-evolving standards and libraries to deal with > it. Parsing 'the web' (which is rapidly growing to include stuff > like SVG, MathML etc) is even harder. > > > My personal opinion is that HTML5Lib gets this problem almost completely > right, and so it should be absorbed by the stdlib. A little data: the HTML5lib project lives at https://code.google.com/p/html5lib/ It has 4 owners and 22 other committers. The most recent release, html5lib 0.90 for Python, is nearly 2 years old. Since there is a separate Python3 repository, and there is no mention on Python3 compatibility elsewhere that I saw, including the pypi listing, I assume that is for Python2 only. A comment on a recent (July 11) Python3 issue https://code.google.com/p/html5lib/issues/detail?id=187&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port suggest that the Python3 version still has problems. "Merged in now, though still lots of errors and failures in the testsuite." -- Terry Jan Reedy
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4