This page describes the data objects and annotations used in Stanza, and how they interact with each other.
DocumentA Document
object holds the annotation of an entire document, and is automatically generated when a string is annotated by the Pipeline
. It contains a collection of Sentence
s and entities (which are represented as Span
s), and can be seamlessly translated into a native Python object.
Document
contains the following properties:
str
The raw text of the document. sentences List[Sentence]
The list of sentences in this document. entities (ents) List[Span]
The list of entities in this document. num_tokens int
The total number of tokens in this document. num_words int
The total number of words in this document.
Document
also contains the following method(s):
Iterator[Word]
An iterator that returns all of the words in this Document in order. iter_tokens Iterator[Token]
An iterator that returns all of the tokens in this Document in order. to_dict List[List[Dict]]
Dumps the whole document into a list of list of dictionaries, each dictionary representing a token, which are grouped by sentences in the document. to_serialized bytes
Dumps (with pickle) the whole document including text to a byte array containing a list of list of dictionaries for each token in each sentence in the doc. from_serialized Document
A class method for creating and initializing a new document from a serialized string generated by Document.to_serialized_string(). Sentence
A Sentence
object represents a sentence (as is segmented by the TokenizeProcessor or provided by the user), and contains a list of Token
s in the sentence, a list of all its Word
s, as well as a list of entities in the sentence (represented as Span
s).
Sentence
contains the following properties:
Document
A “back pointer” to the parent doc of this sentence. text str
The raw text for this sentence. dependencies List[(Word,str,Word)]
The list of dependencies for this sentence, where each item contains the head Word
of the dependency relation, the type of dependency relation, and the dependent Word
in that relation. tokens List[Token]
The list of tokens in this sentence. words List[Word]
The list of words in this sentence. entities (ents) List[Span]
The list of entities in this sentence. sentiment str
The sentiment value for this sentence, as a string. Note that only a few languages have a sentiment model. constituency ParseTree
The constituency parse for this sentence, as a ParseTree. Note that only a few languages have a constituency model. sent_id str
An ID given to sentences. Can be assigned numerically from the document index when parsing, or read from the # sent_id
comments when reading a CoNLL file
Sentence
also contains the following methods:
List[Dict]
Dumps the sentence into a list of dictionaries, where each dictionary represents a token in the sentence. print_dependencies None
Print the syntactic dependencies for this sentence. print_tokens None
Print the tokens for this sentence. print_words None
Print the words for this sentence. tokens_string str
Similar to print_tokens
, but instead of print the tokens, dump the tokens into a string. words_string str
Similar to print_words
, but instead of print the words, dump the words into a string. Token
A Token
object holds a token, and a list of its underlying syntactic Word
s. In the event that the token is a multi-word token (e.g., French au = à le), the token will have a range id
as described in the CoNLL-U format specifications (e.g., 3-4
), with its words
property containing the underlying Word
s corresponding to those id
s. In other cases, the Token
object will function as a simple wrapper around one Word
object, where its words
property is a singleton.
Token
contains the following properties:
Tuple[int]
The index of this token in the sentence, 1-based. This index contains two elements (e.g., (1, 2)
) when the corresponding token is a multi-word token, otherwise it contains just a single element (e.g., (1, )
). text str
The text of this token. Example: ‘The’. misc str
Miscellaneous annotations with regard to this token. Used in the pipeline to store whether a token is a multi-word token, for instance. words List[Word]
The list of syntactic words underlying this token. start_char int
The start character index for this token in the raw text of the document. Particularly useful if you want to detokenize at one point, or apply annotations back to the raw text. end_char int
The end character index for this token in the raw text of the document. Particularly useful if you want to detokenize at one point, or apply annotations back to the raw text. ner str
The NER tag of this token, in BIOES format. Example: ‘B-ORG’. spaces_after str
The space(s) following a token. Note that it is here, not on Word, as it is not expected to have spaces between Words in an MWT. spaces_before str
The space(s) before a token, especially relevant at the start of a document
Token
also contains the following methods:
List[Dict]
Dumps the token into a list of dictionares, each dictionary representing one of the words underlying this token. pretty_print str
Print this token with the words it expands into in one line. Word
A Word
object holds a syntactic word and all of its word-level annotations. In the event of multi-word tokens (MWT), words are generated as a result of applying the MWTProcessor, and are used in all downstream syntactic analyses such as tagging, lemmatization, and parsing. If a Word
is the result from an MWT expansion, its text
will usually not be found in the input raw text. Aside from multi-word tokens, Word
s should be similar to the familiar “tokens” one would see elsewhere.
Word
contains these properties:
int
The index of this word in the sentence, 1-based (index 0 is reserved for an artificial symbol that represents the root of the syntactic tree). text str
The text of this word. Example: ‘The’. lemma str
The lemma of this word. upos (pos) str
The universal part-of-speech of this word. Example: ‘NOUN’. xpos str
The treebank-specific part-of-speech of this word. Example: ‘NNP’. feats str
The morphological features of this word. Example: ‘Gender=Fem|Person=3’. head int
The id of the syntactic head of this word in the sentence, 1-based for actual words in the sentence (0 is reserved for an artificial symbol that represents the root of the syntactic tree). deprel str
The dependency relation between this word and its syntactic head. Example: ‘nmod’. deps str
The combination of head and deprel that captures all syntactic dependency information. Seen in CoNLL-U files released from Universal Dependencies, not predicted by our Pipeline
. misc str
Miscellaneous annotations with regard to this word. The pipeline uses this field to store character offset information internally, for instance. parent Token
A “back pointer” to the parent token that this word is a part of. In the case of a multi-word token, a token can be the parent of multiple words.
Word
also contains the following methods:
Dict
Dumps the word into a dictionary with all its information. pretty_print str
Prints the word in one line with all its information. Span
A Span
object stores attributes of a contiguous span of text. A range of objects (e.g., named entities) can be represented as a Span
.
Span
contains the following properties:
Document
A “back pointer” to the parent document of this span. text str
The text of this span. tokens List[Token]
The list of tokens that correspond to this span. words List[Word]
The list of words that correspond to this span. type str
The entity type of this span. Example: ‘PERSON’. start_char int
The start character offset of this span in the document. end_char int
The end character offset of this span in the document.
Span
also contains the following methods:
Dict
Dumps the span into a dictionary containing all its information. pretty_print str
Prints the span in one line with all its information. ParseTree
A ParseTree
object is a nested tree structure, intended to represent the result of the constituency parser. Each layer of nesting has the following properties:
str
The label of an inner node represents the bracket type. Preterminals have a POS tag as the label. Leaves have the text of the word as the label. children List[ParseTree]
The children of this bracket. Preterminals have one child and represent a tag and a word. The leaves represent just the word and have no children. Adding new properties to Stanza data objects
New in v1.1
All Stanza data objects can be extended easily should you need to attach new annotations of interest to them, either through a new Processor
you are developing, or from some custom code you’re writing.
To add a new annotation or property to a Stanza object, say a Document
, simply call
Document.add_property('char_count', default=0, getter=lambda self: len(self.text), setter=None)
And then you should be able to access the char_count
property from all instances of the Document
class. The interface here should be familiar if you have used class properties in Python or other object-oriented language – the first and only mandatory argument is the name of the property you wish to create, followed by default
for the default value of this property, getter
for reading the value of the property, and setter
for setting the value of the property.
By default, all created properties are read-only, unless you explicitly assign a setter
. The underlying variable for the new property is named _{property_name}
, so in our example above, Stanza will automatically create a class variable named _char_count
to store the value of this property should it be necessary. This is the variable your getter
and setter
functions should use, if needed.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4