On 30Jul2018 13:40, Bob Purvy <bpurvy at gmail.com> wrote: >I've been trying to figure out how to access the archives programmatically. >I'm sure this is easy once you know, but googling various things hasn't >worked. What I want to do is graph the number of messages about PEP 572 by >time. (or has someone already done that?) > >I installed GNU Mailman, and downloaded the gzip'ed archives for a number >of months and unzipped them, and I suspect that there's some way to get >them all into a single database, but it hasn't jumped out at me. If I >count the "Message-ID" lines, the "Subject:" lines, and the "\nFrom " lines >in one of those text files, I get slightly different numbers for each. > >Alternatively, they're maybe *already* in a database, and I just need API >access to do the querying? Can someone help me out? Like Victor, I download mailing list archives. Between pulling them in and also subscribing, ideally I get a complete history in my "python" mail folder. Likewise for other lists. The mailman archives are UNIX mbox files, compressed, with a bit of header munging (to make address harvesting harder). You can concatenate them and uncompress and reverse the munging like this: cat *.gz | gunzip | fix-mail-dates --mbox | un-at- where fix-mail-dates is here: https://bitbucket.org/cameron_simpson/css/src/tip/bin/fix-mail-dates and un-at- is here: https://bitbucket.org/cameron_simpson/css/src/tip/bin/un-at- and the output is a nice UNIX mbox file. You can load that into most mail readers or parse it with Python's email modules (in the stdlib). It should be easy enough to scan such a thing and count header contents etc. Ignore the "From " line content, prefer the "From:" header. (Separate messages on "From " of course, just don't grab email addresses from it.) Cheers, Cameron Simpson <cs at cskk.id.au>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4