"Jennifer Jeng" <jeng at cliffie.nosc.mil> wrote in message news:9ai3vs$mok$1 at newpoisson.nosc.mil... > Hi, > > Is there a way to search for a specific html tags and remove the begin, end > tags include its content (begin tag can go multiple lines to the end tag)? > > example: > > replace: > > this text is before the begin title tags and <title>dddfdfdkfdjfdjfldjfl > dkfdlfjdlf jdkjfdk > djkfd > dfdkfdkf</title> this text is > at the end of the > end title tag > > to: > > this text is before the begin title tags and this text is > at the end of the > end title tag Not clear exactly what tags qualify here, but, assuming you want to remove everything that IS between ANY tags, and only leave what is not, this might work: import sgmllib class afilter(sgmllib.SGMLParser): def __init__(self): sgmllib.SGMLParser.__init__(self) self.inTag = 0 self.data = [] def unknown_starttag(self, tag, attributes): self.inTag += 1 def unknown_endtag(self, tag): self.inTag -= 1 def handle_data(self, data): if self.inTag: return self.data.append(data) if __name__=='__main__': sometext = """ this text is before the begin title tags and <title>dddfdfdkfdjfdjfldjfl dkfdlfjdlf jdkjfdk djkfd dfdkfdkf</title> this text is at the end of the end title tag""" filt = afilter() filt.feed(sometext) filt.close() print ''.join(filt.data) The only difference wrt your desired output in your example is that, of course, TWO spaces will be between 'and' and 'this', since that is the number of spaces outside of tags in the string being processed:-). Alex
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4