I do not really know, what you want to do. Getting he urls from the a tags of a html file? I think the easiest method would be a regular expression. >>>import urllib, sre >>>html = urllib.urlopen("http://www.google.com").read() >>>sre.findall('href="([^>]+)"', html) ['/imghp?hl=de&tab=wi&ie=UTF-8', 'http://groups.google.de/grphp?hl=de&tab=wg&ie=UTF-8', '/dirhp?hl=de&tab=wd&ie=UTF-8', 'http://news.google.de/nwshp?hl=de&tab=wn&ie=UTF-8', 'http://froogle.google.de/frghp?hl=de&tab=wf&ie=UTF-8', '/intl/de/options/'] >>> sre.findall('href=[^>]+>([^<]+)</a>', html) ['Bilder', 'Groups', 'Verzeichnis', 'News', 'Froogle', 'Mehr ยป', 'Erweiterte Suche', 'Einstellungen', 'Sprachtools', 'Werbung', 'Unternehmensangebote', 'Alles \xfcber Google', 'Google.com in English'] Google has some strange html, href without quotation marks: <a href=http://www.google.com/ncr>Google.com in English</a>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4