A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/flairNLP/CleanCoNLL below:

flairNLP/CleanCoNLL: The CleanCoNLL dataset from our EMNLP 2023 paper where we corrected annotation errors and inconsistencies in CoNLL-03.

CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset

We semi-automatically corrected annotation errors in the classic CoNLL-03 dataset for Named Entity Recognition (NER). Get our corpus CleanCoNLL -- CoNLL-03 with nearly noise-free NER annotations -- with the help of this repository!

For details of the creation and evaluation of the dataset, have a look at our EMNLP 2023 paper: Rücker and Akbik (2023)!

Details are in the paper, but in short: We leveraged the Wikipedia links from the AIDA CoNLL Yago dataset for assigning NER labels for each mention in a hybrid (automatically as well as manual) relabeling approach. Furthermore, we performed several rounds of cross-checking for correcting remaining errors and resolving inconsistencies. Overall, we updated 7% of labels from the original CoNLL-03.

We keep the original tagging scheme with 4 types (PER, LOC, ORG, MISC). We add the NEL (Named Entity Linking) annotations, i.e. Wikipedia links to our annotations.

Note: As source text base, we used the corrected corpus version by Reiss et al. (2020) (paper, repo), as they not only already modified some of the label errors, but also corrected some problems with token, sentence and mention splitting.

We distribute our CleanCoNLL annotations in column format. In the annotation files the tokens are masked ([TOKEN]) for licence reasons, but you'll find a simple shell script that allows you to recreate CleanCoNLL with the help of the original CoNLL-03.

Step-by-step guide:

The three files will look like this: Column format with the following 5 columns, the last 3 with BIO tagging scheme:

Token     POS     Wikipedia     NER (CleanCoNLL*)     NER (CleanCoNLL)

CleanCoNLL* is the CleanCoNLL version before Phase 3, i.e. before reverting the adjectival affiliations back to MISC, see paper for details.

Sentences are separated by an empty line, articles by the -DOCSTART- token.

So, an excerpt of the dataset looks like this:

-DOCSTART-	-X-	O	O	O

SOCCER	NN	O	O	O
-	:	O	O	O
JAPAN	NNP	B-Japan_national_football_team	B-ORG	B-ORG
GET	VB	O	O	O
LUCKY	NNP	O	O	O
WIN	NNP	O	O	O
,	,	O	O	O
CHINA	NNP	B-China_national_football_team	B-ORG	B-ORG
IN	IN	O	O	O
SURPRISE	DT	O	O	O
DEFEAT	NN	O	O	O
.	.	O	O	O

If you use CleanCoNLL or find our approach useful, please cite our work.

@inproceedings{rucker-akbik-2023-cleanconll,
    title = "{C}lean{C}o{NLL}: A Nearly Noise-Free Named Entity Recognition Dataset",
    author = {R{\"u}cker, Susanna  and Akbik, Alan},
    editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.533",
    doi = "10.18653/v1/2023.emnlp-main.533",
    pages = "8628--8645",
}
@misc{rücker2023cleanconll,
      title={{C}lean{C}o{NLL}: A Nearly Noise-Free Named Entity Recognition Dataset}, 
      author={Susanna R{\"u}cker and Alan Akbik},
      year={2023},
      eprint={2310.16225},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4