Release 0.11 is taking us ever closer to that 1.0 release! This release makes large internal refactorings and code quality / efficiency improvements to prepare Flair 1.0. We also add new features such as text clustering, a regular expression tagger, more dataset manipulation options, and some preview features like a prototype decoder.
New Features Regular Expression Tagger (#2533)You can now do sequence labeling in Flair with regular expressions! Simply define a RegexpTagger
and add some regular expressions, like in the example below:
# sentence with a number and two quotes sentence = Sentence('Figure 11 is both "too colorful" and "not informative enough".') # instantiate regex tagger with a quote matching pattern tagger = RegexpTagger(mapping=(r'(["\'])(?:(?=(\\?))\2.)*?\1', 'QUOTE')) # also add a number mapping tagger.register_labels(mapping=(r'\b\d+\b', 'NUMBER')) # tag sentence tagger.predict(sentence) # check out matches for entity in sentence.get_labels(): print(entity)Clustering with Flair (#2573 #2619)
Flair now supports clustering by ways of sklearn. Embed your sentences with a pre-trained embedding like below, then cluster then with any algorithm. Check the example below where we use sentence transformers and k-means clustering. A 'trained' clustering model can be saved and loaded for prediction, just like and other Flair classifier:
from sklearn.cluster import KMeans from flair.data import Sentence from flair.datasets import TREC_6 from flair.embeddings import SentenceTransformerDocumentEmbeddings from flair.models import ClusteringModel embeddings = SentenceTransformerDocumentEmbeddings() # store all embeddings in memory which is required to perform clustering corpus = TREC_6(memory_mode='full').downsample(0.05) clustering_model = ClusteringModel(model=KMeans(n_clusters=6), embeddings=embeddings) # fit the model on a corpus clustering_model.fit(corpus) # save the model clustering_model.save(model_file="clustering_model.pt") # load saved clustering model model = ClusteringModel.load(model_file="clustering_model.pt") # make example sentence sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"') # predict for sentence model.predict(sentence) # print sentence with prediction print(sentence)Dataset Manipulations
You can now change label names, ignore labels and add custom preprocessing when loading a dataset.
For instance, the standard WNUT_17 dataset comes with 7 NER labels:
corpus = WNUT_17(in_memory=False) print(corpus.make_label_dictionary('ner'))
which prints:
Dictionary with 7 tags: <unk>, person, location, group, corporation, product, creative-work
With the following code, you rename some labels ('person' is renamed to 'PER'), merge 2 labels into 1 ('group' and 'corporation' are merged into 'LOC'), and ignore 2 other labels ('creative-work' and 'product' are ignored):
corpus = WNUT_17(in_memory=False, label_name_map={ 'person': 'PER', 'location': 'LOC', 'group': 'ORG', 'corporation': 'ORG', 'product': 'O', 'creative-work': 'O', # by renaming to 'O' this tag gets ignored })
which prints:
Dictionary with 4 tags: <unk>, PER, LOC, ORG
You can manipulate the data even more with custom preprocessing functions. See the example in #2708.
Other New Features and Data SetsWordTagger
class for simple word-level predictions (#2607)WordEmbeddings
can now be fine-tuned in Flair (#2491) by setting fine_tune=True. Also adds fine-tuning mode of https://arxiv.org/abs/2110.02861 which seem to "reduce gradient variance that comes from the highly non-uniform distribution of input tokens"NER_MULTI_CONER
Dataset (#2507)Some preview features in beta stage, use at your own risk.
Prototypical networks in Flair (#2627)Prototype networks learn prototypes for each target class. For each data point to be classified, the network predicts a vector in class-prototype-space, which is then compared to all class prototypes.The prediction is then the closest class prototype. See paper Prototypical Networks for Few-shot Learning for more info.
@plonerma implemented a custom decoder that can be added to any Flair model that inherits from DefaultClassifier
(i.e. early all Flair models). For instance, use this script:
from flair.data import Corpus from flair.datasets import UP_ENGLISH from flair.embeddings import TransformerWordEmbeddings from flair.models import WordTagger from flair.nn import PrototypicalDecoder from flair.trainers import ModelTrainer # what tag do we want to predict? tag_type = 'frame' # get a corpus corpus: Corpus = UP_ENGLISH().downsample(0.1) # make the tag dictionary from the corpus tag_dictionary = corpus.make_label_dictionary(label_type=tag_type) # initialize simple embeddings embeddings = TransformerWordEmbeddings(model="distilbert-base-uncased", fine_tune=True, layers='-1') # initialize prototype decoder decoder = PrototypicalDecoder(num_prototypes=len(tag_dictionary), embeddings_size=embeddings.embedding_length, distance_function='euclidean', normal_distributed_initial_prototypes=True, ) # initialize the WordTagger, but pass the prototype decoder tagger = WordTagger(embeddings, tag_dictionary, tag_type, decoder=decoder) # initialize trainer trainer = ModelTrainer(tagger, corpus) # run training trainer.fine_tune('resources/taggers/prototypical_decoder')Other Beta features
With Flair expanding to many new NLP tasks (relation extraction, entity linking, etc.) and model types, we made a number of refactorings to reduce redundancy and make it easier to extend Flair.
Major refactoring of Label Logic in Flair (#2607 #2609 #2645)The labeling logic was growing too complex to accommodate new tasks. With this release, we refactored this logic such that complex label classes like SpanLabel
, RelationLabel
etc. are removed in favor of a single Label
class for all types of label. The Sentence
object will now be automatically aware of all labels added to it.
To illustrate the difference, consider a before-and-after of how to add an entity label to a sentence.
Before:
# example sentence sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .") # create span for "Humboldt Universität zu Berlin" span = Span(sentence[0:4]) # make a Span-label span_label = SpanLabel(span=span, value='University') # add Span-label to sentence sentence.add_complex_label(typename='ner', label=span_label)
Now:
# example sentence sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .") # directly add a label to the span "Humboldt Universität zu Berlin" sentence[0:4].add_label("ner", "Organization")
So you can now just get a span from the sentence and add a label to it directly. It will get registered on the sentence as well.
Refactoring of printouts (#2704)We changed and unified printouts across all Flair data points and labels, and updated the documentation to reflect this. Printouts should hopefully now be more concise. Let us know what you think.
Unified classes to reduce redundancyNext to too many Label classes (see above), we also had too many corpora that essentially do the same thing, two partially overlapping transformer embedding classes and too much redundancy in our tokenization classes. This release makes many refactorings to make the code more maintainable:
ColumnCorpus
, UniversalDependenciesCorpus
, CoNNLuCorpus
, and EntityLinkingCorpus
, which resulted in too much redundancy. Now, there is only the ColumnCorpus
for all such datasetsTransformerWordEmbedding
and TransformerDocumentEmbedding
. Thanks to @helpmefindaname, they now both inherit from the same base object and now share all features.Tokenizer
classes no longer return lists of Token
, rather lists of strings that the Sentence
object converts to tokens, centralizing the offset and whitespace_after detection in one place.The DefaultClassifier
is the base class for nearly all models in Flair. With this release, we make a number of simplifications to reduce redundancy across classes and make it more modular.
forward_pass
simplified to return 3 instead of 4 argumentsforward_pass
returns embeddings instead of logits allowing us to easily switch out the decoder (see Beta feature on Prototype Networks below)spawn
logic we no longer need due to Label refactoringMajor refactoring of SequenceTagger
for better modularity and code readability.
Spans are no longer stored as word-level 'bioes' tags, but rather directly stored as span-level annotations. The SequenceTagger
will still internally use BIO/BIOES tags, but the corpora and sentences no longer explicitly store this information.
So you now choose the labeling format when instantiating the SequenceTagger
, i.e.:
tagger = SequenceTagger( hidden_size=256, embeddings=embeddings, tag_dictionary=tag_dictionary, tag_type="ner", tag_format="BIOES", # choose if you want to use BIOES or BIO internally )
Internally, this refactoring makes a number of changes and simplifications:
DataPoint
class, for convenience, including properties to get start_position
and end_position
of datapoints, their text
, their tag
and score
(if they have only one tag) and an unlabeled_identifier
set_embedding()
and to()
from the data point classes (Sentence
, Token
, etc.) to their parent DataPoint
get_tag
and add_tag
have been removed from Token in favor of the get_label
and add_label
method of the parent DataPoint classColumnCorpus
will automatically identify which columns are span labels and treat them accordinglyThey are back and more strict than ever! Thanks to @helpmefindaname, we now include mypy and formatting tests as part of our build process, which lead to many changes in the code and a much greater chance at catching errors early.
Speed and Memory Improvements:EntityLinker
class refactored for speed (#2607)evaluate()
method, especially for large datasets (#2607)ColumnCorpus
no longer does disk reads when in_memory=False
, it simply stores the raw data in memory leading to significant speed-ups on large datasets (#2607)Dictionary
(#2532)WSD_UFSAC
corpus (#2521)DocumentPoolEmbeddings
class can now be instantiated only with a single embedding (#2645)min_count
when computing the label dictionary. Labels below that count will be UNK'ed. (e.g. tag_dictionary = corpus.make_label_dictionary("ner", min_count=10)
) (#2607)Dictionary
will now compute count statistics for labels in a corpus (#2607)ColumnCorpus
can now handle relation annotation, dependency tree information and UD feats and misc (#2607)RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4