java.lang.Object org.apache.hadoop.conf.Configuration org.apache.hadoop.mapred.JobConf
public class JobConf
A map/reduce job configuration.
JobConf
is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as-is described by JobConf
, however:
setNumReduceTasks(int)
), some parameters interact subtly rest of the framework and/or job-configuration and is relatively more complex for the user to control finely (e.g. setNumMapTasks(int)
).JobConf
typically specifies the Mapper
, combiner (if any), Partitioner
, Reducer
, InputFormat
and OutputFormat
implementations to be used etc.
Optionally JobConf
is used to specify other advanced facets of the job such as Comparator
s to be used, files to be put in the DistributedCache
, whether or not intermediate and/or job outputs are to be compressed (and how), debugability via user-provided scripts ( setMapDebugScript(String)
/setReduceDebugScript(String)
), for doing post-processing on task logs, task's stdout, stderr, syslog. and etc.
Here is an example on how to configure a job via JobConf
:
// Create a new JobConf JobConf job = new JobConf(new Configuration(), MyJob.class); // Specify various job-specific parameters job.setJobName("myjob"); FileInputFormat.setInputPaths(job, new Path("in")); FileOutputFormat.setOutputPath(job, new Path("out")); job.setMapperClass(MyJob.MyMapper.class); job.setCombinerClass(MyJob.MyReducer.class); job.setReducerClass(MyJob.MyReducer.class); job.setInputFormat(SequenceFileInputFormat.class); job.setOutputFormat(SequenceFileOutputFormat.class);
JobClient
, ClusterStatus
, Tool
, DistributedCache
JobConf()
JobConf(boolean loadDefaults)
JobConf(Class exampleClass)
JobConf(Configuration conf)
JobConf(Configuration conf, Class exampleClass)
JobConf(Path config)
JobConf(String config)
addDefaultResource, addResource, addResource, addResource, addResource, clear, dumpConfiguration, get, get, getBoolean, getClass, getClass, getClassByName, getClasses, getClassLoader, getConfResourceAsInputStream, getConfResourceAsReader, getEnum, getFile, getFloat, getInstances, getInt, getLocalPath, getLong, getRange, getRaw, getResource, getStringCollection, getStrings, getStrings, getValByRegex, iterator, main, readFields, reloadConfiguration, set, setBoolean, setBooleanIfUnset, setClass, setClassLoader, setEnum, setFloat, setIfUnset, setInt, setLong, setQuietMode, setStrings, size, toString, unset, write, writeXml, writeXml
MAPRED_TASK_MAXVMEM_PROPERTY
@Deprecated public static final String MAPRED_TASK_MAXVMEM_PROPERTY
MAPRED_JOB_MAP_MEMORY_MB_PROPERTY
and MAPRED_JOB_REDUCE_MEMORY_MB_PROPERTY
@Deprecated public static final String UPPER_LIMIT_ON_TASK_VMEM_PROPERTY
@Deprecated public static final String MAPRED_TASK_DEFAULT_MAXVMEM_PROPERTY
@Deprecated public static final String MAPRED_TASK_MAXPMEM_PROPERTY
public static final long DISABLED_MEMORY_LIMIT
public static final String MAPRED_LOCAL_DIR_PROPERTY
public static final String DEFAULT_QUEUE_NAME
public static final String MAPRED_JOB_MAP_MEMORY_MB_PROPERTY
public static final String MAPRED_JOB_REDUCE_MEMORY_MB_PROPERTY
@Deprecated public static final String MAPRED_TASK_JAVA_OPTS
MAPRED_MAP_TASK_JAVA_OPTS
or MAPRED_REDUCE_TASK_JAVA_OPTS
MAPRED_TASK_ULIMIT
can be used to control the maximum virtual memory of the child processes. The configuration variable MAPRED_TASK_ENV
can be used to pass other environment variables to the child processes.
public static final String MAPRED_MAP_TASK_JAVA_OPTS
MAPRED_MAP_TASK_ULIMIT
can be used to control the maximum virtual memory of the map processes. The configuration variable MAPRED_MAP_TASK_ENV
can be used to pass other environment variables to the map processes.
public static final String MAPRED_REDUCE_TASK_JAVA_OPTS
MAPRED_REDUCE_TASK_ULIMIT
can be used to control the maximum virtual memory of the reduce processes. The configuration variable MAPRED_REDUCE_TASK_ENV
can be used to pass process environment variables to the reduce processes.
public static final String DEFAULT_MAPRED_TASK_JAVA_OPTS
@Deprecated public static final String MAPRED_TASK_ULIMIT
MAPRED_MAP_TASK_ULIMIT
or MAPRED_REDUCE_TASK_ULIMIT
MAPRED_TASK_JAVA_OPTS
, else the VM might not start.
public static final String MAPRED_MAP_TASK_ULIMIT
MAPRED_MAP_TASK_JAVA_OPTS
, else the VM might not start.
public static final String MAPRED_REDUCE_TASK_ULIMIT
MAPRED_REDUCE_TASK_JAVA_OPTS
, else the VM might not start.
@Deprecated public static final String MAPRED_TASK_ENV
MAPRED_MAP_TASK_ENV
or MAPRED_REDUCE_TASK_ENV
k1=v1,k2=v2
. Further it can reference existing environment variables via $key
. Example:
public static final String MAPRED_MAP_TASK_ENV
k1=v1,k2=v2
. Further it can reference existing environment variables via $key
. Example:
public static final String MAPRED_REDUCE_TASK_ENV
k1=v1,k2=v2
. Further it can reference existing environment variables via $key
. Example:
public static final String WORKFLOW_ID
public static final String WORKFLOW_NAME
public static final String WORKFLOW_NODE_NAME
public static final String WORKFLOW_ADJACENCY_PREFIX_STRING
public static final String WORKFLOW_ADJACENCY_PREFIX_PATTERN
public static final String WORKFLOW_TAGS
public static final String MAPREDUCE_RECOVER_JOB
public static final boolean DEFAULT_MAPREDUCE_RECOVER_JOB
public JobConf()
public JobConf(Class exampleClass)
exampleClass
- a class whose containing jar is used as the job's jar.
public JobConf(Configuration conf)
conf
- a Configuration whose settings will be inherited.
public JobConf(Configuration conf, Class exampleClass)
conf
- a Configuration whose settings will be inherited.
exampleClass
- a class whose containing jar is used as the job's jar.
public JobConf(String config)
config
- a Configuration-format XML job description file.
public JobConf(Path config)
config
- a Configuration-format XML job description file.
public JobConf(boolean loadDefaults)
loadDefaults
is false, the new instance will not load resources from the default files.
loadDefaults
- specifies whether to load from the default files
public Credentials getCredentials()
public String getJar()
public void setJar(String jar)
jar
- the user jar for the map-reduce job.
public void setJarByClass(Class cls)
cls
- the example class.
public String[] getLocalDirs() throws IOException
IOException
public void deleteLocalFiles() throws IOException
IOException
public void deleteLocalFiles(String subdir) throws IOException
IOException
public Path getLocalPath(String pathString) throws IOException
IOException
public String getUser()
public void setUser(String user)
user
- the username for this job.
public void setKeepFailedTaskFiles(boolean keep)
keep
- true
if framework should keep the intermediate files for failed tasks, false
otherwise.
public boolean getKeepFailedTaskFiles()
public void setKeepTaskFilesPattern(String pattern)
pattern
- the java.util.regex.Pattern to match against the task names.
public String getKeepTaskFilesPattern()
public void setWorkingDirectory(Path dir)
dir
- the new current working directory.
public Path getWorkingDirectory()
public void setNumTasksToExecutePerJvm(int numTasks)
numTasks
- the number of tasks to execute; defaults to 1; -1 signifies no limit
public int getNumTasksToExecutePerJvm()
public InputFormat getInputFormat()
InputFormat
implementation for the map-reduce job, defaults to TextInputFormat
if not specified explicity.
InputFormat
implementation for the map-reduce job.
public void setInputFormat(Class<? extends InputFormat> theClass)
InputFormat
implementation for the map-reduce job.
theClass
- the InputFormat
implementation for the map-reduce job.
public OutputFormat getOutputFormat()
OutputFormat
implementation for the map-reduce job, defaults to TextOutputFormat
if not specified explicity.
OutputFormat
implementation for the map-reduce job.
public OutputCommitter getOutputCommitter()
OutputCommitter
implementation for the map-reduce job, defaults to FileOutputCommitter
if not specified explicitly.
OutputCommitter
implementation for the map-reduce job.
public void setOutputCommitter(Class<? extends OutputCommitter> theClass)
OutputCommitter
implementation for the map-reduce job.
theClass
- the OutputCommitter
implementation for the map-reduce job.
public void setOutputFormat(Class<? extends OutputFormat> theClass)
OutputFormat
implementation for the map-reduce job.
theClass
- the OutputFormat
implementation for the map-reduce job.
public void setCompressMapOutput(boolean compress)
compress
- should the map outputs be compressed?
public boolean getCompressMapOutput()
true
if the outputs of the maps are to be compressed, false
otherwise.
public void setMapOutputCompressorClass(Class<? extends CompressionCodec> codecClass)
CompressionCodec
for the map outputs.
codecClass
- the CompressionCodec
class that will compress the map outputs.
public Class<? extends CompressionCodec> getMapOutputCompressorClass(Class<? extends CompressionCodec> defaultValue)
CompressionCodec
for compressing the map outputs.
defaultValue
- the CompressionCodec
to return if not set
CompressionCodec
class that should be used to compress the map outputs.
IllegalArgumentException
- if the class was specified, but not found
public Class<?> getMapOutputKeyClass()
public void setMapOutputKeyClass(Class<?> theClass)
theClass
- the map output key class.
public Class<?> getMapOutputValueClass()
public void setMapOutputValueClass(Class<?> theClass)
theClass
- the map output value class.
public Class<?> getOutputKeyClass()
public void setOutputKeyClass(Class<?> theClass)
theClass
- the key class for the job output data.
public RawComparator getOutputKeyComparator()
RawComparator
comparator used to compare keys.
RawComparator
comparator used to compare keys.
public void setOutputKeyComparatorClass(Class<? extends RawComparator> theClass)
RawComparator
comparator used to compare keys.
theClass
- the RawComparator
comparator used to compare keys.
setOutputValueGroupingComparator(Class)
public void setKeyFieldComparatorOptions(String keySpec)
KeyFieldBasedComparator
options used to compare keys.
keySpec
- the key specification of the form -k pos1[,pos2], where, pos is of the form f[.c][opts], where f is the number of the key field to use, and c is the number of the first character from the beginning of the field. Fields and character posns are numbered starting with 1; a character position of zero in pos2 indicates the field's last character. If '.c' is omitted from pos1, it defaults to 1 (the beginning of the field); if omitted from pos2, it defaults to 0 (the end of the field). opts are ordering options. The supported options are: -n, (Sort numerically) -r, (Reverse the result of comparison)
public String getKeyFieldComparatorOption()
KeyFieldBasedComparator
options
public void setKeyFieldPartitionerOptions(String keySpec)
KeyFieldBasedPartitioner
options used for Partitioner
keySpec
- the key specification of the form -k pos1[,pos2], where, pos is of the form f[.c][opts], where f is the number of the key field to use, and c is the number of the first character from the beginning of the field. Fields and character posns are numbered starting with 1; a character position of zero in pos2 indicates the field's last character. If '.c' is omitted from pos1, it defaults to 1 (the beginning of the field); if omitted from pos2, it defaults to 0 (the end of the field).
public String getKeyFieldPartitionerOption()
KeyFieldBasedPartitioner
options
public RawComparator getOutputValueGroupingComparator()
WritableComparable
comparator for grouping keys of inputs to the reduce.
for details.
public void setOutputValueGroupingComparator(Class<? extends RawComparator> theClass)
RawComparator
comparator for grouping keys in the input to the reduce.
This comparator should be provided if the equivalence rules for keys for sorting the intermediates are different from those for grouping keys before each call to Reducer.reduce(Object, java.util.Iterator, OutputCollector, Reporter)
.
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed in a single call to the reduce function if K1 and K2 compare as equal.
Since setOutputKeyComparatorClass(Class)
can be used to control how keys are sorted, this can be used in conjunction to simulate secondary sort on values.
Note: This is not a guarantee of the reduce sort being stable in any sense. (In any case, with the order of available map-outputs to the reduce being non-deterministic, it wouldn't make that much sense.)
theClass
- the comparator class to be used for grouping keys. It should implement RawComparator
.
setOutputKeyComparatorClass(Class)
public boolean getUseNewMapper()
public void setUseNewMapper(boolean flag)
flag
- true, if the new api should be used
public boolean getUseNewReducer()
public void setUseNewReducer(boolean flag)
flag
- true, if the new api should be used
public Class<?> getOutputValueClass()
public void setOutputValueClass(Class<?> theClass)
theClass
- the value class for job outputs.
public Class<? extends Mapper> getMapperClass()
Mapper
class for the job.
Mapper
class for the job.
public void setMapperClass(Class<? extends Mapper> theClass)
Mapper
class for the job.
theClass
- the Mapper
class for the job.
public Class<? extends MapRunnable> getMapRunnerClass()
MapRunnable
class for the job.
MapRunnable
class for the job.
public void setMapRunnerClass(Class<? extends MapRunnable> theClass)
MapRunnable
class for the job. Typically used to exert greater control on Mapper
s.
theClass
- the MapRunnable
class for the job.
public Class<? extends Partitioner> getPartitionerClass()
Partitioner
used to partition Mapper
-outputs to be sent to the Reducer
s.
Partitioner
used to partition map-outputs.
public void setPartitionerClass(Class<? extends Partitioner> theClass)
Partitioner
class used to partition Mapper
-outputs to be sent to the Reducer
s.
theClass
- the Partitioner
used to partition map-outputs.
public Class<? extends Reducer> getReducerClass()
Reducer
class for the job.
Reducer
class for the job.
public void setReducerClass(Class<? extends Reducer> theClass)
Reducer
class for the job.
theClass
- the Reducer
class for the job.
public Class<? extends Reducer> getCombinerClass()
Reducer
for the job i.e. getReducerClass()
.
public void setCombinerClass(Class<? extends Reducer> theClass)
The combiner is an application-specified aggregation operation, which can help cut down the amount of data transferred between the Mapper
and the Reducer
, leading to better performance.
The framework may invoke the combiner 0, 1, or multiple times, in both the mapper and reducer tasks. In general, the combiner is called as the sort/merge result is written to disk. The combiner must:
Typically the combiner is same as the Reducer
for the job i.e. setReducerClass(Class)
.
theClass
- the user-defined combiner class used to combine map-outputs.
public boolean getSpeculativeExecution()
true
.
true
if speculative execution be used for this job, false
otherwise.
public void setSpeculativeExecution(boolean speculativeExecution)
speculativeExecution
- true
if speculative execution should be turned on, else false
.
public boolean getMapSpeculativeExecution()
true
.
true
if speculative execution be used for this job for map tasks, false
otherwise.
public void setMapSpeculativeExecution(boolean speculativeExecution)
speculativeExecution
- true
if speculative execution should be turned on for map tasks, else false
.
public boolean getReduceSpeculativeExecution()
true
.
true
if speculative execution be used for reduce tasks for this job, false
otherwise.
public void setReduceSpeculativeExecution(boolean speculativeExecution)
speculativeExecution
- true
if speculative execution should be turned on for reduce tasks, else false
.
public int getNumMapTasks()
1
.
public void setNumMapTasks(int n)
Note: This is only a hint to the framework. The actual number of spawned map tasks depends on the number of InputSplit
s generated by the job's InputFormat.getSplits(JobConf, int)
. A custom InputFormat
is typically used to accurately control the number of map tasks for the job.
The number of maps is usually driven by the total size of the inputs i.e. total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
The default behavior of file-based InputFormat
s is to split the input into logical InputSplit
s based on the total size, in bytes, of input files. However, the FileSystem
blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size.
Thus, if you expect 10TB of input data and have a blocksize of 128MB, you'll end up with 82,000 maps, unless setNumMapTasks(int)
is used to set it even higher.
n
- the number of map tasks for this job.
InputFormat.getSplits(JobConf, int)
, FileInputFormat
, FileSystem.getDefaultBlockSize()
, FileStatus.getBlockSize()
public int getNumReduceTasks()
1
.
public void setNumReduceTasks(int n)
The right number of reduces seems to be 0.95
or 1.75
multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).
With 0.95
all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75
the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks, failures etc.
Reducer NONEIt is legal to set the number of reduce-tasks to zero
.
In this case the output of the map-tasks directly go to distributed file-system, to the path set by FileOutputFormat.setOutputPath(JobConf, Path)
. Also, the framework doesn't sort the map-outputs before writing it out to HDFS.
n
- the number of reduce tasks for this job.
public int getMaxMapAttempts()
mapred.map.max.attempts
property. If this property is not already set, the default is 4 attempts.
public void setMaxMapAttempts(int n)
n
- the number of attempts per map task.
public int getMaxReduceAttempts()
mapred.reduce.max.attempts
property. If this property is not already set, the default is 4 attempts.
public void setMaxReduceAttempts(int n)
n
- the number of attempts per reduce task.
public String getJobName()
public void setJobName(String name)
name
- the job's new name.
public String getSessionId()
public void setSessionId(String sessionId)
sessionId
- the new session id.
public void setMaxTaskFailuresPerTracker(int noFailures)
noFailures
, the tasktracker is blacklisted for this job.
noFailures
- maximum no. of failures of a given job per tasktracker.
public int getMaxTaskFailuresPerTracker()
public int getMaxMapTaskFailuresPercent()
getMaxMapAttempts()
attempts before being declared as failed. Defaults to zero
, i.e. any failed map-task results in the job being declared as JobStatus.FAILED
.
public void setMaxMapTaskFailuresPercent(int percent)
getMaxMapAttempts()
attempts before being declared as failed.
percent
- the maximum percentage of map tasks that can fail without the job being aborted.
public int getMaxReduceTaskFailuresPercent()
getMaxReduceAttempts()
attempts before being declared as failed. Defaults to zero
, i.e. any failed reduce-task results in the job being declared as JobStatus.FAILED
.
public void setMaxReduceTaskFailuresPercent(int percent)
getMaxReduceAttempts()
attempts before being declared as failed.
percent
- the maximum percentage of reduce tasks that can fail without the job being aborted.
public void setJobPriority(JobPriority prio)
JobPriority
for this job.
prio
- the JobPriority
for this job.
public JobPriority getJobPriority()
JobPriority
for this job.
JobPriority
for this job.
public boolean getProfileEnabled()
public void setProfileEnabled(boolean newValue)
newValue
- true means it should be gathered
public String getProfileParams()
public void setProfileParams(String value)
value
- the configuration string
public Configuration.IntegerRanges getProfileTaskRange(boolean isMap)
isMap
- is the task a map?
public void setProfileTaskRange(boolean isMap, String newValue)
newValue
- a set of integer ranges of the map ids
public void setMapDebugScript(String mDbgScript)
The debug script can aid debugging of failed map tasks. The script is given task's stdout, stderr, syslog, jobconf files as arguments.
The debug command, run on the node where the map failed, is:
$script $stdout $stderr $syslog $jobconf.
The script file is distributed through DistributedCache
APIs. The script needs to be symlinked.
Here is an example on how to submit a script
job.setMapDebugScript("./myscript"); DistributedCache.createSymlink(job); DistributedCache.addCacheFile("/debug/scripts/myscript#myscript");
mDbgScript
- the script name
public String getMapDebugScript()
setMapDebugScript(String)
public void setReduceDebugScript(String rDbgScript)
The debug script can aid debugging of failed reduce tasks. The script is given task's stdout, stderr, syslog, jobconf files as arguments.
The debug command, run on the node where the map failed, is:
$script $stdout $stderr $syslog $jobconf.
The script file is distributed through DistributedCache
APIs. The script file needs to be symlinked
Here is an example on how to submit a script
job.setReduceDebugScript("./myscript"); DistributedCache.createSymlink(job); DistributedCache.addCacheFile("/debug/scripts/myscript#myscript");
rDbgScript
- the script name
public String getReduceDebugScript()
setReduceDebugScript(String)
public String getJobEndNotificationURI()
null
if it hasn't been set.
setJobEndNotificationURI(String)
public void setJobEndNotificationURI(String uri)
The uri can contain 2 special parameters: $jobId and $jobStatus. Those, if present, are replaced by the job's identifier and completion-status respectively.
This is typically used by application-writers to implement chaining of Map-Reduce jobs in an asynchronous manner.
uri
- the job end notification uri
JobStatus
, Job Completion and Chaining
public String getJobLocalDir()
When a job starts, a shared directory is created at location ${mapred.local.dir}/taskTracker/$user/jobcache/$jobid/work/
. This directory is exposed to the users through job.local.dir
. So, the tasks can use this space as scratch space and share files among them.
public long getMemoryForMapTask()
DISABLED_MEMORY_LIMIT
. For backward compatibility, if the job configuration sets the key MAPRED_TASK_MAXVMEM_PROPERTY
to a value different from DISABLED_MEMORY_LIMIT
, that value will be used after converting it from bytes to MB.
DISABLED_MEMORY_LIMIT
if unset.
public void setMemoryForMapTask(long mem)
public long getMemoryForReduceTask()
DISABLED_MEMORY_LIMIT
. For backward compatibility, if the job configuration sets the key MAPRED_TASK_MAXVMEM_PROPERTY
to a value different from DISABLED_MEMORY_LIMIT
, that value will be used after converting it from bytes to MB.
DISABLED_MEMORY_LIMIT
if unset.
public void setMemoryForReduceTask(long mem)
public String getQueueName()
public void setQueueName(String queueName)
queueName
- Name of the queue
public static long normalizeMemoryConfigValue(long val)
val
-
@Deprecated public long getMaxVirtualMemoryForTask()
getMemoryForMapTask()
and getMemoryForReduceTask()
MAPRED_TASK_MAXVMEM_PROPERTY
This method is deprecated. Now, different memory limits can be set for map and reduce tasks of a job, in MB. For backward compatibility, if the job configuration sets the key MAPRED_TASK_MAXVMEM_PROPERTY
to a value different from DISABLED_MEMORY_LIMIT
, that value is returned. Otherwise, this method will return the larger of the values returned by getMemoryForMapTask()
and getMemoryForReduceTask()
after converting them into bytes.
DISABLED_MEMORY_LIMIT
, if unset.
setMaxVirtualMemoryForTask(long)
@Deprecated public void setMaxVirtualMemoryForTask(long vmem)
setMemoryForMapTask(long mem)
and Use setMemoryForReduceTask(long mem)
MAPRED_TASK_MAXVMEM_PROPERTY
mapred.task.maxvmem is split into mapred.job.map.memory.mb and mapred.job.map.memory.mb,mapred each of the new key are set as mapred.task.maxvmem / 1024 as new values are in MB
vmem
- Maximum amount of virtual memory in bytes any task of this job can use.
getMaxVirtualMemoryForTask()
@Deprecated public long getMaxPhysicalMemoryForTask()
@Deprecated public void setMaxPhysicalMemoryForTask(long mem)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4