For a high-level description of each package, refer to the table here.
Official Supported platforms.
Supported Hadoop Clusters.
The following table specifies where each package should be installed in your Hadoop cluster.
Package Where to Installplyrmr
On every node in the cluster ravro
Only on the node that runs the R client rhbase
Only on the node that runs the R client rhdfs
Only on the node that runs the R client rmr2
On every node in the cluster
The RHadoop packages can be installed either manually or via a shell script. Both methods are described in this section. However, the commands listed in the shell script are to be used for guidance only, and should be adapted to standards of your IT department.
The following instructions are for installing and configuring for rmr2
.
On every node in the cluster, do the following:
Download the R packages dependencies for rmr2
. Check the values for the Depends:
and Imports:
lines in the package DESCRIPTION file for the most up-to-date list of dependencies. The suggested quickcheck is needed only for testing and a link to it can be found on its repo.
Install rmr2
and its dependent R packages.
Update the environment variables needed by rmr2
. The values for the environments will depend upon your Hadoop distribution.
Important! These environment variables only need to be set on the nodes that are invoking the rmr2
MapReduce jobs, such as an Edge
node. If you don’t know which nodes will be used, then set these variables on each node. Also, it is recommended to add these environment variables to the file /etc/profile
so that they will be available to all users.
HADOOP_CMD
: The complete path to the “hadoop” executable. For example:
export HADOOP_CMD=/usr/bin/hadoop
HADOOP_STREAMING
: The complete path to the Hadoop Streaming jar file. For example:
export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
The following instructions are for installing and configuring for plyrmr
.
On every node in the cluster, do the following:
Download the dependent R packages for plyrmr
. Check the values for the Depends:
and Imports:
lines in the package DESCRIPTION file for the most up-to-date list of dependencies.
Install plyrmr
and its dependent R packages.
Update the environment variables needed by plyrmr
. The values for the environments will depend upon your Hadoop distribution.
Important! These environment variables only need to be set on the nodes that are invoking the rmr2
MapReduce jobs, such as an Edge
node. If you don’t know which nodes will be used, then set these variables on each node. Also, it is recommended to add these environment variables to the file /etc/profile
so that they will be available to all users.
HADOOP_CMD
: The complete path to the “hadoop” executable. For example:
export HADOOP_CMD=/usr/bin/hadoop
HADOOP_STREAMING
: The complete path to the Hadoop Streaming jar file. For example:
export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
The following instructions are for installing and configuring for rhdfs
.
On the node that will run the R client, do the following:
Download and install the rJava R package.
Important! If the installation of rJava fails, you may need to configure R to run properly with Java. First, check to be sure you have the Java JDK installed, and the environment variable JAVA_HOME
is pointing to the Java JDK. To configure R to run with Java, type the command: R CMD javareconf
, and then try installing rJava again.
Update the environment variable needed by rhdfs
. The value for the environment variable will depend upon your hadoop distribution.
HADOOP_CMD
: The complete path to the “hadoop” executable. For example:
export HADOOP_CMD=/usr/bin/hadoop
**Important!** This environment variable only needs to be set on the nodes that are using the `rhdfs` package, such as an `Edge` node. Also, it is recommended to add this environment variable to the file `/etc/profile` so that it will be available to all users.
Install rhdfs
only on the node that will run the R client.
The following instructions are for installing and configuring for rhdfs
.
On the node that will run the R client, do the following:
Build and install Apache Thrift. We recommend that you install on the node containing the HBase Master. See http://thrift.apache.org/ for more details on building and installing Thrift.
Install the dependencies for Thrift. At the prompt, type:
yum -y install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel openssl-devel
Important! If installing as NON-ROOT, then you will need a system administrator to help install these dependencies.
Unpack the Thrift archive. At the prompt, type:
tar -xzf thrift-0.8.0.tar.gz
thrift
directory. At the prompt, type./configure --without-ruby --without-python
make
**Important!** If installing as NON-ROOT, then this command will most likely require root privileges, and will have to be executed by your system administrator.
rhbase
package. Example of symbolic link:ln -s /usr/local/lib/libthrift-0.8.0.so /usr/lib64
**Important!** If installing as NON-ROOT, then you may need a system administrator to execute this command for you.
PKG_CONFIG_PATH
environment variable. At the prompt, type:export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig
Install rhbase
only on the node that will run the R client.
The following instructions are for installing and configuring for ravro
.
On the node that will run the R client, do the following:
Download the dependent R packages for ravro
. Check the values for the Depends:
and Imports:
lines in the package DESCRIPTION file for the most up-to-date list of dependencies.
Install ravro
and its dependent R packages only on the node that will run the R client.
There are two sets of tests you should do to verify that your configuration is working.
##First Tests
The first set of tests will check that the installed packages can be loaded and initialized.
rmr2
package, and execute some simple commands. At the R prompt, type the following commands: > library(rmr2)
> from.dfs(to.dfs(1:100))
> from.dfs(mapreduce(to.dfs(1:100)))
If any errors occur:
1. Verify that Revolution R Open / Microsoft R Open is installed on each node in the cluster.
1. Check that `rmr2`, and its dependent packages are installed on each node in the cluster.
1. Make sure that a link to Rscript executable is in the PATH on each node in the Hadoop cluster.
1. The user that invoked ‘R’ has read and write permissions to HDFS.
1. Verify that the `HADOOP_CMD` environment variable is set, exported and its value is the complete path of the “hadoop” executable.
1. Verify that the `HADOOP_STREAMING` environment variable is set, exported and its value is the complete path to the Hadoop Streaming jar file.
1. If you encounter errors like the following (see below), check the `stderr` log file for the job, and resolve any errors reported. The easiest way to find the log files is to use the tracking URL (i.e. `http://<my_ip_address>:50030/jobdetails.jsp?jobid=job_201208162037_0011`)
```
12/08/24 21:21:16 INFO streaming.StreamJob: Running job: job_201208162037_0011
12/08/24 21:21:16 INFO streaming.StreamJob: To kill this job, run:
12/08/24 21:21:16 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=<my_ip_address>:8021 -kill job_201208162037_0011
12/08/24 21:21:16 INFO streaming.StreamJob: Tracking URL: http://<my_ip_address>:50030/jobdetails.jsp?jobid=job_201208162037_0011
12/08/24 21:21:17 INFO streaming.StreamJob: map 0% reduce 0%
12/08/24 21:21:23 INFO streaming.StreamJob: map 50% reduce 0%
12/08/24 21:21:31 INFO streaming.StreamJob: map 50% reduce 17%
12/08/24 21:21:45 INFO streaming.StreamJob: map 100% reduce 100%
12/08/24 21:21:45 INFO streaming.StreamJob: To kill this job, run:
12/08/24 21:21:45 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=<my_ip_address>:8021 -kill job_201208162037_0011
12/08/24 21:21:45 INFO streaming.StreamJob: Tracking URL: http://<my_ip_address>:50030/jobdetails.jsp?jobid=job_201208162037_0011
12/08/24 21:21:45 ERROR streaming.StreamJob: Job not successful. Error: NA
12/08/24 21:21:45 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, in.folder = if (is.list(input)) { :
hadoop streaming failed with error code 1
```
rhdfs
package. At the R prompt, type the following commands: > library(rhdfs)
> hdfs.init()
> hdfs.ls("/")
If any errors occur:
Verify that the rJava
package is installed, configured and loaded.
Verify that the HADOOP_CMD
is set and its value is set to the complete path of the “hadoop” executable, and exported.
Load and initialize the rhbase
package.
Note: The “>” symbol in the following code is the ‘R’ prompt and should not be typed.
> library(rhbase)
> hb.init()
> hb.list.tables()
If any errors occur:
Verify that the Thrift Server is running (refer to your Hadoop documentation for more details).
Verify that the default port for the Thrift Server is 9090
. Be sure there is not a port conflict with other running processes.
Check to be sure you are not running the Thrift Server in hsha
or nonblocking
mode. If necessary, use the threadpool
command line parameter to start the server (i.e. /usr/bin/hbase thrift –threadpool start
).
##Second Tests
The second set of tests will verify that your configuration is working properly using the standard R mechanism for checking packages.
Important! Be aware that running the tests for the rmr2
package may take a significant time (hours) to complete. If you run the tests for rmr2
, then you will need the quickcheck R package on every node in the cluster as well.
Go to the directory where the R package source (rmr2
, rhdfs
, rhbase
) exist.
Check each package. An example of the commands for each package:
R CMD check rmr2_3.2.0.tar.gz
R CMD check rhdfs_1.0.8.tar.gz
R CMD check rhbase_1.2.1.tar.gz
If any errors occur, refer to the error verification information described above under First Tests.
Note: Errors referring to missing package pdflatex
can be ignored, such as: Error in texi2dvi("Rd2.tex", pdf = (out_ext == "pdf"), quiet = FALSE, : pdflatex is not available Error in running tools::texi2dvi
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4