RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/flairNLP/CleanCoNLL below:

flairNLP/CleanCoNLL: The CleanCoNLL dataset from our EMNLP 2023 paper where we corrected annotation errors and inconsistencies in CoNLL-03.

CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset

We semi-automatically corrected annotation errors in the classic CoNLL-03 dataset for Named Entity Recognition (NER). Get our corpus CleanCoNLL -- CoNLL-03 with nearly noise-free NER annotations -- with the help of this repository!

For details of the creation and evaluation of the dataset, have a look at our EMNLP 2023 paper: Rücker and Akbik (2023)!

Details are in the paper, but in short: We leveraged the Wikipedia links from the AIDA CoNLL Yago dataset for assigning NER labels for each mention in a hybrid (automatically as well as manual) relabeling approach. Furthermore, we performed several rounds of cross-checking for correcting remaining errors and resolving inconsistencies. Overall, we updated 7% of labels from the original CoNLL-03.

We keep the original tagging scheme with 4 types (PER, LOC, ORG, MISC). We add the NEL (Named Entity Linking) annotations, i.e. Wikipedia links to our annotations.

Note: As source text base, we used the corrected corpus version by Reiss et al. (2020) (paper, repo), as they not only already modified some of the label errors, but also corrected some problems with token, sentence and mention splitting.

We distribute our CleanCoNLL annotations in column format. In the annotation files the tokens are masked ([TOKEN]) for licence reasons, but you'll find a simple shell script that allows you to recreate CleanCoNLL with the help of the original CoNLL-03.

Step-by-step guide:

Clone this repository.
Inside /data/cleanconll_annotations you can find our masked annotation files (cleanconll_annotations.dev.train, cleanconll_annotations.dev, cleanconll_annotations.test).
Inside /data/patch_files you find patch files that represent the updates in the text base between the original CoNLL-03 and CleanCoNLL. This is needed for merging our annotations to the original corpus!
However, you simply need to run our script
```
chmod u+x create_cleanconll_from_conll03.sh
bash create_cleanconll_from_conll03.sh
```
which will:
- download the original CoNLL-03 corpus
- apply the patch files to the original CoNLL-03 text for alligning the text base before merging our annotations
- create the three CleanCoNLL files with text and annotations, they will be placed inside /data/cleanconll.

The three files will look like this: Column format with the following 5 columns, the last 3 with BIO tagging scheme:

Token     POS     Wikipedia     NER (CleanCoNLL*)     NER (CleanCoNLL)

CleanCoNLL* is the CleanCoNLL version before Phase 3, i.e. before reverting the adjectival affiliations back to MISC, see paper for details.

Sentences are separated by an empty line, articles by the -DOCSTART- token.

So, an excerpt of the dataset looks like this:

-DOCSTART-	-X-	O	O	O

SOCCER	NN	O	O	O
-	:	O	O	O
JAPAN	NNP	B-Japan_national_football_team	B-ORG	B-ORG
GET	VB	O	O	O
LUCKY	NNP	O	O	O
WIN	NNP	O	O	O
,	,	O	O	O
CHINA	NNP	B-China_national_football_team	B-ORG	B-ORG
IN	IN	O	O	O
SURPRISE	DT	O	O	O
DEFEAT	NN	O	O	O
.	.	O	O	O

If you use CleanCoNLL or find our approach useful, please cite our work.

In the EMNLP 2023 Proceedings:

@inproceedings{rucker-akbik-2023-cleanconll,
    title = "{C}lean{C}o{NLL}: A Nearly Noise-Free Named Entity Recognition Dataset",
    author = {R{\"u}cker, Susanna  and Akbik, Alan},
    editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.533",
    doi = "10.18653/v1/2023.emnlp-main.533",
    pages = "8628--8645",
}

On arXiv:

@misc{rücker2023cleanconll,
      title={{C}lean{C}o{NLL}: A Nearly Noise-Free Named Entity Recognition Dataset}, 
      author={Susanna R{\"u}cker and Alan Akbik},
      year={2023},
      eprint={2310.16225},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4