dataReporter is an R package for documenting and creating reports on data cleanliness.
This github page contains the development version of dataReporter. For the latest stable version download the package from CRAN directly using
install.packages("dataReporter")
To install the development version of dataReporter run the following commands from within R (requires that the devtools
package is already installed)
devtools::install_github("ekstroem/dataReporter")
A super simple way to get started is to load the package and use the makeDataReport()
function on a data frame (if you try to generate several reports for the same data, then it may be necessary to add the replace=TRUE
argument to overwrite the existing report).
library("dataReporter")
data(trees)
makeDataReport(trees)
This will create a report with summaries and error checks for each variable in the trees
data frame. The format of the report depends on your OS and whether you have have a LaTeX installation on your computer, which is needed for creating pdf reports.
The dataReporter package can also be used interactively by running checks for the individual variables or for all variables in the dataset
data(toyData)
check(toyData$events) # Individual check of events
check(toyData) # Check all variables at once
By default the standard battery of tests is run depending on the variable type. If we just want a specific test for, say, a numeric variable then we can specify that. All available checks can be viewed by calling allCheckFunctions()
. See the documentation for an overview of the checks available or how to create and include your own tests.
check(toyData$events, checks = setChecks(numeric = "identifyMissing"))
We can also access the graphics or summary tables that are produced for a variable by calling the visualize
or summarize
functions. One can visualize a single variable or a full dataset:
#Visualize a variable
visualize(toyData$events)
#Visualize a dataset
visualize(toyData)
The same is true for summaries. Note also that the choice of checks/visualizations/summaries are customizable:
#Summarize a variable with default settings:
summarize(toyData$events)
#Summarize a variable with user-specified settings:
summarize(toyData$events, summaries = setSummaries(all = c("centralValue", "minMax"))
You can read the main paper accompanying the package at the Journal of Statistical Software. It provides a detailed introduction to the dataReporter package (original launched under the name dataMaid
).
We also have two blog posts that provide an introduction to the package. The can be found here (the primary one) and here.
Moreover, we have created a vignette that describes how to extend dataReporter to include user-defined data screening checks, summaries and visualizations. This vignette is called extending_dataReporter
:
vignette("extending_dataReporter")
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4