RetroSearch Browse

Tue Apr 17 22:48:50 EDT 2001 · http://mail.python.org/pipermail/python-list/2001-April/092039.html

On Tue, 17 Apr 2001, Mark Pilgrim wrote:

> Well, you wouldn't be the first person to tell me that. <0.5 wink>
> 
thanks for the expanded reply. However, I still am just not getting
SGMLParser

> For those not familiar with how SGMLParser works, it will call this method
> with an HTML tag ("tag", a string) and the attributes of the tag ("attrs", a

I've tried again with a formulation from Guido's intro to web
programming.  Here's the error..

=====================================
Traceback (most recent call last):
  File "./html3", line 46, in ?
    htmlbuffer.feed(buffer)
  File "/usr/local/lib/python1.6/sgmllib.py", line 82, in feed
    self.rawdata = self.rawdata + data
TypeError: illegal argument type for built-in operation

===================================

I grabbed the rpm for python 1.6.  I'm so new to the language that I
didn't see why 2.x would help.  I'm still trying to overcome years of
Rexx.  anyway, comments appreciated.

====================================
#!/usr/local/bin/python
# first test to open web pages using urlopen2
import sys
from sgmllib import SGMLParser

class HtmlBody(SGMLParser):

        def __init__(self):
		self.links = []
		self.body = ()
		SGMLParser.__init__(self)

	def do_body(self, attrs):
		for (name, value) in attrs:
			if name == "body":
				value = value
				if value:
					self.body = value
			if name == "href":
				value = cleanlink(value)
				if value:
					self.links.append(value)

	def getlinks(self):
		return self.links

	def cleanlink(link):
		i = string.find(link, '#')
		if i >= 0:
			link = link[:i]
		words = string.split(link)
		string.join(words, "")

if __name__ == '__main__':
#	print sys.argv[1:]
	try:
		f = open("dean.html")
	except IOError:
		print "couldn't open ", sys.argv[1:]
		sys.exit(1)
        buffer = ""
	htmlbuffer = HtmlBody()
	buffer = f.readlines()
	f.close()
	htmlbuffer.feed(buffer)
	htmlbuffer.close()
	body = htmlbuffer.do_body
	links = htmlbuffer.getlinks
	print body
#	print %s %links

> 
> - Suppose the original tag is '<a href="index.html" title="Go to home
> page">'
> - The method will be called with tag='a' and attrs=[('href', 'index.html'),
> ('title', 'Go to home page')]
> - The list comprehension will produce a list of 2 elements: ['
> href="index.html"', ' title="Go to home page"']
> - strattrs will be ' href="index.html" title="Go to home page"'
> - The string appended to self.parts will be '<a href="index.html" title="Go
> to home page">', which is what we want.
> 
> Other than using string.join(..., "") instead of "".join(...) -- a topic
> which has been beaten to death recently on this newsgroup and which I
> address explicitly in my book
> (http://diveintopython.org/odbchelper_join.html) -- how would you rewrite
> this?
> 
> -M
> You're smart; why haven't you learned Python yet?
> http://diveintopython.org/
> Now in Chinese!  http://diveintopython.org/cn/
> 
> 
> 
> 

David Bear
College of Public Programs/ASU

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from http://mail.python.org/pipermail/python-list/2001-April/092039.html below:

syntax from diveintopython