A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/attardi/wikiextractor/wiki/File-Format below:

File Format · attardi/wikiextractor Wiki · GitHub

Document files contains a series of Wikipedia articles, represented each by an XML <tt>doc</tt> element: ... ... ... ...

The element <tt>doc</tt> has the following attributes:

The content of a <tt>doc</tt> element consists of pure text, one sentence per line.

Here is an example of a <tt>doc</tt> element:

Harmonium. L'harmonium è uno strumento musicale azionato con una tastiera, detta manuale. Sono stati costruiti anche alcuni harmonium con due manuali. ...

Notice that because of Wikipedia conventions, the first sentence is the title of the article.

Such documents are produced by Wikipedia Extractor .


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.3