"Fredrik Lundh" <fredrik at pythonware.com> writes: > well, the examples in your PEP can be written as: > > data = [line[:-1].split(":") for line in open(filename)] Yes, in practice I would write this, too. The example was for pedagogical purposes, but perhaps the fact that it's not particularly useful in practice makes it a bad example. > and > > import ConfigParser > > c = ConfigParser.ConfigParser() > c.read(filename) > > data = [] > for section in c.sections(): > data.append((section, c.items(section))) > > both of which are shorter than your structparse examples. Hmm. In this case it doesn't seem fair to compare a call to ConfigParser, rather than the code in the ConfigParser module itself (or at least the subset of it that would provide the equivalent functionality). I used this as an example because I thought most people would be familiar with this file format, thus saving them having to figure out some new file format in order to follow the PEP. In practice, there are lots of other file formats of similar complexity that are not handled by any such special purpose module, and structmatch would make it easy to parse them. For example, this "SQT" format is output by a certain mass spec analysis program: http://fields.scripps.edu/sequest/SQTFormat.html There are a number of other bioinformatics programs whose output unfortunately must be scraped at present. The structmatch feature would also be useful for these cases. (This is what motivated the PEP.) > and most of the one-liners in your pre-PEP can be handled with a > combination of "match" and "finditer". I think "findall" and "finditer" are almost useless for this kind of thing, as they are essentially "searching" rather than "matching". That is, they'll happily, silently skip over garbage to get to something they like. Since finditer returns matches, you can always inspect the match to determine whether anything was skipped, but this seems kind of lame compared to just doing the right thing in the first place (i.e., matching). > here's a 16-line helper that > parses strings matching the "a(b)*c" pattern into a prefix/list/tail tuple. > > import re > > def parse(string, pat1, pat2): > """Parse a string having the form pat1(pat2)*""" > m = re.match(pat1, string) > i = m.end() > a = m.group(1) > b = [] > for m in re.compile(pat2 + "|.").finditer(string, i): > try: > token = m.group(m.lastindex) > except IndexError: > break > b.append(token) > i = m.end() > return a, b, string[i:] > > >>> parse("hello 1 2 3 4 # 5", "(\w+)", "\s*(\d+)") > ('hello', ['1', '2', '3', '4'], ' # 5') No offense, but this code makes me cringe. The "|." trick seems like a horrific hack, and I'd need to stare at this code for quite a while to convince myself that it doesn't have some subtle flaw. And even then it just handles the "a(b)*c" case. It seems like the code for more complex patterns parsed this way would just explode in size, and would have to be written custom for each pattern. Mike
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4