egrep " [A-Z][a-z ]+.*[A-Z].* (19|20)[0-9][0-9] " ~/a/ot/repo/reference-taxonomy/tax/ncbi/taxonomy.tsv | fgrep -v "." >tmp.tmp
(should also work with the names.txt file that ships with NCBI)
This yields 72 results, many of which are parsed incorrectly. Unfortunately any rules you make for heuristically dealing with these are going to be baroque, and increasingly so as you try to get the false positive and false negative rates down. So I really don't expect this issue to be addressed. But I thought you should know.
Examples:
Kuraishia capsulata CBS 1993
is a strain, 1993 is not a yearLeishmania donovani Ld 2001
2001 is a strain; Ld is short for 'L. donavi', not an authoritybut also I'm impressed by how many gnparse gets right, e.g.
Bat coronavirus China 2005
(China is not an author)Lumpy skin disease virus Nigeria 1996
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4