Status: New Owner: ---- New issue 98 by nikolay.panov: Encoding issue: 'ascii' codec instead of appropriate one. http://code.google.com/p/html5lib/issues/detail?id=98
This issue is related with the following sentence in the docs: "If no encoding can be found and the chardet library is available, an attempt will be made to sniff the encoding from the byte pattern " * What steps will reproduce the problem? >>> html=fetch_url('http://www.ixbt.com/news/soft/index.shtml?11/72/39') >>> p = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("beautifulsoup")) >>> soup = p.parse(html) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/niksite/lib/site-python/html5lib/html5parser.py", line 177, in parse self._parse(stream, innerHTML=False, encoding=encoding) File "/home/niksite/lib/site-python/html5lib/html5parser.py", line 93, in _parse self.mainLoop() File "/home/niksite/lib/site-python/html5lib/html5parser.py", line 149, in mainLoop self.phase.processStartTag(token["name"], token["data"]) File "/home/niksite/lib/site-python/html5lib/html5parser.py", line 314, in processStartTag self.startTagHandler[name](name, attributes) File "/home/niksite/lib/site-python/html5lib/html5parser.py", line 605, in startTagMeta data = inputstream.EncodingBytes(attributes["content"]) UnicodeEncodeError: 'ascii' codec can't encode characters in position 12-18: ordinal not in range(128) >>> chardet.detect(html) {'confidence': 0.94890270449856784, 'encoding': 'windows-1251'} * What is the expected output? What do you see instead? As we can see, chardet successfully detect the 'windows-1251' encoding of the html document provided. Why html5lib try to use 'ascii' codec? -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To post to this group, send email to html5lib-discuss@googlegroups.com To unsubscribe from this group, send email to html5lib-discuss+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB -~----------~----~----~----~------~----~------~--~---
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4