On Nov 3, 2009, at 12:06 AM, Guido van Rossum wrote: > On Mon, Nov 2, 2009 at 9:51 PM, ssteinerX at gmail.com <ssteinerx at gmail.com > > wrote: >> BeautifulSoup, which I use every day, is one such product. Since >> the crappy >> old SMGL parser's gone, BeautifulSoup uses the one that's left in >> Python 3 >> and it makes BeautifulSoup completely useless for my daily work. > > This sounds an area where some help might be useful. Perhaps the > quickest solution would simply be to copy the old crappy "sgml" based > html parser into a new version of BeautifulSoup. That is what we're discussing doing on the old-soup branch at http://github.com/adevore/old-beautiful-soup . I'm not exactly sure why the old SGML parser was dropped but it seems that porting it to Python 3 would be enough of an effort that it caused the Python library to drop it, and the current developer of the mainline of Beautiful Soup to decide to just use what was available in Python 3 natively. > Though I imagine what it really needs is a "quirks mode" parser that > is compatible with the > HTML dialect accepted by, say, IE6. Maybe a summer of code project? I think it just relies on the old SGML parser's not blowing up on completely bogus HTML (like most of the web) and does the best it can with the 'chunks' that come back; nothing to do with quirks mode per se. As for a Summer of Code project, I have no idea what would be involved. I know there are lots of users for Beautiful soup; as far as I know it is the best scraper of HTML code, valid or not, that's out there and it's been around a long time and I see it in projects in the "html scraping" realm all the time. At any rate, it's just one example of where the developer has taken the easy route out with a 3.0 port and has produced a product that's "Python 3" but, instead of getting better with Python's new features, has actually become useless for the majority of use-cases where formerly it shined. S
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4