On Tue, 17 Apr 2001, Mark Pilgrim wrote: > Well, you wouldn't be the first person to tell me that. <0.5 wink> > thanks for the expanded reply. However, I still am just not getting SGMLParser > For those not familiar with how SGMLParser works, it will call this method > with an HTML tag ("tag", a string) and the attributes of the tag ("attrs", a I've tried again with a formulation from Guido's intro to web programming. Here's the error.. ===================================== Traceback (most recent call last): File "./html3", line 46, in ? htmlbuffer.feed(buffer) File "/usr/local/lib/python1.6/sgmllib.py", line 82, in feed self.rawdata = self.rawdata + data TypeError: illegal argument type for built-in operation =================================== I grabbed the rpm for python 1.6. I'm so new to the language that I didn't see why 2.x would help. I'm still trying to overcome years of Rexx. anyway, comments appreciated. ==================================== #!/usr/local/bin/python # first test to open web pages using urlopen2 import sys from sgmllib import SGMLParser class HtmlBody(SGMLParser): def __init__(self): self.links = [] self.body = () SGMLParser.__init__(self) def do_body(self, attrs): for (name, value) in attrs: if name == "body": value = value if value: self.body = value if name == "href": value = cleanlink(value) if value: self.links.append(value) def getlinks(self): return self.links def cleanlink(link): i = string.find(link, '#') if i >= 0: link = link[:i] words = string.split(link) string.join(words, "") if __name__ == '__main__': # print sys.argv[1:] try: f = open("dean.html") except IOError: print "couldn't open ", sys.argv[1:] sys.exit(1) buffer = "" htmlbuffer = HtmlBody() buffer = f.readlines() f.close() htmlbuffer.feed(buffer) htmlbuffer.close() body = htmlbuffer.do_body links = htmlbuffer.getlinks print body # print %s %links > > - Suppose the original tag is '<a href="index.html" title="Go to home > page">' > - The method will be called with tag='a' and attrs=[('href', 'index.html'), > ('title', 'Go to home page')] > - The list comprehension will produce a list of 2 elements: [' > href="index.html"', ' title="Go to home page"'] > - strattrs will be ' href="index.html" title="Go to home page"' > - The string appended to self.parts will be '<a href="index.html" title="Go > to home page">', which is what we want. > > Other than using string.join(..., "") instead of "".join(...) -- a topic > which has been beaten to death recently on this newsgroup and which I > address explicitly in my book > (http://diveintopython.org/odbchelper_join.html) -- how would you rewrite > this? > > -M > You're smart; why haven't you learned Python yet? > http://diveintopython.org/ > Now in Chinese! http://diveintopython.org/cn/ > > > > David Bear College of Public Programs/ASU
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4