"Fredrik Lundh" <fredrik at pythonware.com> writes: [...] > or, if you're going to parse HTML pages from many different sources, a > real parser: > > from HTMLParser import HTMLParser > > class MyHTMLParser(HTMLParser): > > def handle_starttag(self, tag, attrs): > if tag == "a": > for key, value in attrs: > if key == "href": > print value > > p = MyHTMLParser() > p.feed(text) > p.close() > > see: > > http://docs.python.org/lib/module-HTMLParser.html > http://docs.python.org/lib/htmlparser-example.html > http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html It's worth noting that module HTMLParser is less tolerant of the bad HTML you find in the real world than is module sgmllib, which has a similar interface. There are also third party libraries like BeautifulSoup and mxTidy that you may find useful for parsing "HTML as deployed" (ie. bad HTML, often). Also, htmllib is an extension to sgmllib, and will do your link parsing with even less effort: import htmllib, formatter, urllib2 pp = htmllib.HTMLParser(formatter.NullFormatter()) pp.feed(urllib2.urlopen("http://python.org/").read()) print pp.anchorlist Module HTMLParser does have better support for XHTML, though. John
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4