Given an HTML document, extract and clean up the main body text and title.
This is a Python port of a Ruby port of arc90's Readability project.
It's easy using pip
, just run:
$ pip install readability-lxml
As an alternative, you may also use conda to install, just run:
$ conda install -c conda-forge readability-lxml
>>> import requests >>> from readability import Document >>> response = requests.get('http://example.com') >>> doc = Document(response.content) >>> doc.title() 'Example Domain' >>> doc.summary() """<html><body><div><body id="readabilityBody">\n<div>\n <h1>Example Domain</h1>\n <p>This domain is established to be used for illustrative examples in documents. You may use this\n domain in examples without prior coordination or asking for permission.</p> \n <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div> \n</body>\n</div></body></html>"""
This code is under the Apache License 2.0 license.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4