A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/mannau/boilerpipeR below:

mannau/boilerpipeR: Interface to the boilerpipe Java library by Christian Kohlschutter (http://code.google.com/p/boilerpipe/)

boilerpipeR is an R-package which provides an interface to boilerpipe, a Java library written by Christian Kohlschütter [1]. It supports the generic extraction of main text content from HTML files and therefore removes ads, side-bars and headers from the HTML source content. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

To install the latest version from CRAN simply

install.packages("boilerpipeR")

Using the devtools package you can easily install the latest development version of boilerpipeR from github with

library(devtools)
install_github("mannau/boilerpipeR")

Windows users need to use the following command to install from github:

library(devtools)
install_github("mannau/boilerpipeR", args = "--no-multiarch")

To download and extract the main text from e.g. the R-Studio blog you can use the following commands:

library(boilerpipeR)

url <- "http://blog.rstudio.org/2014/05/09/reshape2-1-4/"
maintext <- ArticleExtractor(url, asText = FALSE)
cat(maintext)

[1] Christian Kohlschütter, Exploiting Links and Text Structure on the Web — A Quantitative Approach to Improving Search Quality, PhD Thesis

boilerpipe and boilerpipeR are both released under the Apache Version 2 License


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4