A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/textpipe/textpipe below:

textpipe/textpipe: Textpipe: clean and extract metadata from text

THIS REPOSITORY IS NO LONGER MAINTAINED

textpipe: clean and extract metadata from text

textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.

Vision: the zen of textpipe

It is recommended that you install textpipe using a virtual environment.

virtualenv venv -p python3.6
mkvirtualenv textpipe -p python3.6
pip install -r requirements.txt
A note on spaCy download model requirement

While the requirements.txt file that comes with the package calls for spaCy's en_core_web_sm model, this can be changed depending on the model and language you require for your intended use. See spaCy.io's page on their different models for more information.

>>> from textpipe import doc, pipeline
>>> sample_text = 'Sample text! <!DOCTYPE>'
>>> document = doc.Doc(sample_text)
>>> print(document.clean)
'Sample text!'
>>> print(document.language)
'en'
>>> print(document.nwords)
2

>>> pipe = pipeline.Pipeline(['CleanText', 'NWords'])
>>> print(pipe(sample_text))
{'CleanText': 'Sample text!', 'NWords': 3}

In order to extend the existing Textpipe operations with your own proprietary operations;

test_pipe = pipeline.Pipeline(['CleanText', 'NWords'])
def custom_op(doc, context=None, settings=None, **kwargs):
    return 1

custom_argument = {'argument' :1 }
test_pipe.register_operation('CUSTOM_STEP', custom_op)
test_pipe.steps.append(('CUSTOM_STEP', custom_argument ))

See CONTRIBUTING for guidelines for contributors.

0.12.1

0.12.0

0.11.9

0.11.8

0.11.7

0.11.6

0.11.5

0.11.4

0.11.1

0.11.0

0.9.0

0.8.6

0.8.5

0.8.4

0.8.3

0.8.2

0.8.1

0.8.0

0.7.2

0.7.0


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4