A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://mail.python.org/pipermail/python-list/2005-September/308123.html below:

Parsing an HTML a tag

Parsing an HTML a tagbeza1e1 andreas.zwinkau at googlemail.com
Sat Sep 24 14:03:53 EDT 2005
I do not really know, what you want to do. Getting he urls from the a
tags of a html file? I think the easiest method would be a regular
expression.

>>>import urllib, sre
>>>html = urllib.urlopen("http://www.google.com").read()
>>>sre.findall('href="([^>]+)"', html)
['/imghp?hl=de&tab=wi&ie=UTF-8',
'http://groups.google.de/grphp?hl=de&tab=wg&ie=UTF-8',
'/dirhp?hl=de&tab=wd&ie=UTF-8',
'http://news.google.de/nwshp?hl=de&tab=wn&ie=UTF-8',
'http://froogle.google.de/frghp?hl=de&tab=wf&ie=UTF-8',
'/intl/de/options/']
>>> sre.findall('href=[^>]+>([^<]+)</a>', html)
['Bilder', 'Groups', 'Verzeichnis', 'News', 'Froogle',
'Mehr ยป', 'Erweiterte Suche', 'Einstellungen',
'Sprachtools', 'Werbung', 'Unternehmensangebote', 'Alles \xfcber
Google', 'Google.com in English']

Google has some strange html, href without quotation marks: <a
href=http://www.google.com/ncr>Google.com in English</a>


More information about the Python-list mailing list

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4