THIS REPOSITORY IS NO LONGER MAINTAINED
textpipe: clean and extract metadata from texttextpipe
is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.
HTML
and other unreadable constructsIt is recommended that you install textpipe using a virtual environment.
First, create your virtual environment using virtualenv or virtualenvwrapper.
Using Venv if your default interpreter is python3.6
virtualenv venv -p python3.6
mkvirtualenv textpipe -p python3.6
pip install -r requirements.txtA note on spaCy download model requirement
While the requirements.txt file that comes with the package calls for spaCy's en_core_web_sm model, this can be changed depending on the model and language you require for your intended use. See spaCy.io's page on their different models for more information.
>>> from textpipe import doc, pipeline >>> sample_text = 'Sample text! <!DOCTYPE>' >>> document = doc.Doc(sample_text) >>> print(document.clean) 'Sample text!' >>> print(document.language) 'en' >>> print(document.nwords) 2 >>> pipe = pipeline.Pipeline(['CleanText', 'NWords']) >>> print(pipe(sample_text)) {'CleanText': 'Sample text!', 'NWords': 3}
In order to extend the existing Textpipe operations with your own proprietary operations;
test_pipe = pipeline.Pipeline(['CleanText', 'NWords']) def custom_op(doc, context=None, settings=None, **kwargs): return 1 custom_argument = {'argument' :1 } test_pipe.register_operation('CUSTOM_STEP', custom_op) test_pipe.steps.append(('CUSTOM_STEP', custom_argument ))
See CONTRIBUTING for guidelines for contributors.
0.12.1
0.12.0
0.11.9
ents
properties0.11.8
cats
attribute0.11.7
0.11.6
0.11.5
0.11.4
0.11.1
0.11.0
0.9.0
0.8.6
0.8.5
0.8.4
0.8.3
0.8.2
0.8.1
0.8.0
0.7.2
0.7.0
context
kwargregister_operation
in pipelineRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4