Notebook-scoped R libraries enable you to create and modify custom R environments that are specific to a notebook session. When you install an R notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected.
Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster.
Notebook-scoped libraries libraries are automatically available on workers for SparkR UDFs.
To install libraries for all notebooks attached to a cluster, use cluster-installed libraries. See Compute-scoped libraries.
Install notebook-scoped libraries in RâYou can use any familiar method of installing packages in R, such as install.packages(), the devtools APIs, or Bioconductor.
R packages are accessible to worker nodes as well as the driver node.
Manage notebook-scoped libraries in RâIn this section:
Install a packageâR
require(devtools)
install_version(
package = "caesar",
repos = "http://cran.us.r-project.org"
)
Databricks recommends using a CRAN snapshot as the repository to guarantee reproducible results.
R
devtools::install_github("klutometis/roxygen")
Remove an R package from a notebook environmentâ
To remove a notebook-scoped library from a notebook, use the remove.packages()
command.
R
remove.packages("caesar")
Notebook-scoped R libraries with Spark UDFsâ
In this section:
Notebook-scoped libraries are available on SparkR workers; just import a library to use it. For example, you can run the following to generate a caesar-encrypted message with a SparkR UDF:
R
require(devtools)
install_version(
package = "caesar",
repos = "http://cran.us.r-project.org"
)
library(SparkR)
sparkR.session()
hello <- function(x) {
library(caesar)
caesar("hello world")
}
spark.lapply(c(1, 2), hello)
Notebook-scoped R libraries and sparklyrâ
By default, in sparklyr::spark_apply()
, the packages
argument is set to TRUE
. This copies libraries in the current libPaths
to the workers, allowing you to import and use them on workers. For example, you can run the following to generate a caesar-encrypted message with sparklyr::spark_apply()
:
R
require(devtools)
install_version(
package = "caesar",
repos = "http://cran.us.r-project.org"
)
library(sparklyr)
sc <- spark_connect(method = 'databricks')
apply_caes <- function(x) {
library(caesar)
caesar("hello world")
}
sdf_len(sc, 5) %>%
spark_apply(apply_caes)
If you do not want libraries to be available on workers, set packages
to FALSE
.
RStudio creates a separate library path for each user; therefore users are isolated from each other. However, the library path is not available on workers. If you want to use a package inside SparkR workers in a job launched from RStudio, you need to install it using cluster libraries.
Alternatively, if you use sparklyr UDFs, packages installed inside RStudio are available to workers when using spark_apply(..., packages = TRUE)
.
Explicitly set the installation directory to /databricks/spark/R/lib
. For example, with install.packages()
, run install.packages("pckg", lib="/databricks/spark/R/lib")
. Packages installed in /databricks/spark/R/lib
are shared across all notebooks on the cluster, but they are not accessible to SparkR workers. To share libraries across notebooks and also workers, use cluster libraries.
There is no caching implemented for notebook-scoped libraries on a cluster. If you install a package in a notebook, and another user installs the same package in another notebook on the same cluster, the package is downloaded, compiled, and installed again.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4