Document files contains a series of Wikipedia articles, represented each by an XML <tt>doc</tt> element: ... ... ... ...
The element <tt>doc</tt> has the following attributes:
The content of a <tt>doc</tt> element consists of pure text, one sentence per line.
Here is an example of a <tt>doc</tt> element:
Harmonium. L'harmonium è uno strumento musicale azionato con una tastiera, detta manuale. Sono stati costruiti anche alcuni harmonium con due manuali. ...
Notice that because of Wikipedia conventions, the first sentence is the title of the article.
Such documents are produced by Wikipedia Extractor .
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.3