RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/kevinweil/FileSetInputFormat below:

kevinweil/FileSetInputFormat: A Hadoop input format for sending lists of files as keys to a mapper. Set the list of files, and an input split will be created per file. Each map task gets only one input key: the filename for its split.

An input format for processing individual Path objects one at a time. Useful for distributing large jobs that are inherently single-machine across the entire cluster. Even if they still run on one machine, this way they get given to an under-utilized node at runtime, and there is built-in resilience to task failure.

This input format takes a set of paths and produces a separate input split for each one. If you need to, for example, unzip a collection of five files in HDFS, that unzipping has to happen on a single machine per file. But the set of five files can at least all be unzipped on different machines. Using this input format lets you process each file as you wish in your mapper.

In your main/run method of your Hadoop job driver class, add


Job job = new Job(new Configuration());
...

job.setInputFormatClass(FileSetInputFormat.class);

FileSetInputFormat.addPath("/some/path/to/a/file");
FileSetInputFormat.addPath("/some/other/path/to/a/file");
// Also see FileSetInputFormat.addAllPaths(Collection paths);

Then, make your mapper take a Path as the key and a NullWritable as the value:


public static class MyMapper extends Mapper<Path, NullWritable, ..., ...> {
    protected void map(Path key, NullWritable value, Context context) throws IOException, InterruptedException {
        // Do something with the path, e.g. open it and unzip it to somewhere.
        ...
    }
}

The Path keys are the same paths passed in to FileSetInputFormat.addPath and FileSetInputFormat.addAllPaths. Duplicates are stripped, and one InputSplit is generated per unique Path. That's it!

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4