GitHub

#Introduction NativeTask is a native engine inside Hadoop MapReduce(MR) Task written in C++ and focuses on task performance optimization, while leaving the scheduling and communication job to the MR framework.

NativeTask could be used in two modes:

native MapOutputCollector mode
full native mode

For the first mode, there is little user work needed other than turning on a option and users could run their Java MapReduce job transparently. For the second mode, users will need to write MapReduce jobs in C/C++.

NativeTask feature list:

transparently support existing MRv1 and MRv2 apps
support most common key types and all values
support Java combiner
support Lz4 / Snappy / Gzip
support CRC32 and CRC32C (hardware checksum)
support Hive / Mahout / Pig
support MR ove HBase
support non-sort Map
support hash join

##Motivation We found MapReduce slow for the following reasons:

IO bound with Compression/Decompression overhead
Inefficient Scheduling/Shuffle/Merge
Inefficient memory management
Suboptimal sorting
Inefficient Serialization & Deserialization
Inflexible programming paradigm
Java limitations

NativeTask solves the above issues and is faster because:

Use optimized Compression/Decompression codec
High efficient memory management
Highly optimized sorting
Use hardware optimization when neccessary
Avoid Java runtime side-effects

##Performance overview

Here is the diagram of NativeTask Performance improvement (native MapOutputCollector mode) against Hadoop original.

NativeTask is 2x faster further in full native mode.

##How to use

Native MapOutputCollector mode

In MRv1, please set mapreduce.map.output.collector.delegator.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator in JobConf. For example, to run Pi with native MapOutputCollector

hadoop jar hadoop-examples.jar pi -D mapreduce.map.output.collector.delegator.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator 10 10

MRv2 supports pluggable MapOutputCollector. Set mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator in JobConf. Now the Pi example could be run with native MapOutputCollector as

hadoop jar hadoop-mapreduce-examples.jar pi -D mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator 10 10

In both MRv1 and MRv2, please check the task log, if there is

INFO org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator: Native output collector can be successfully enabled!

Then NativeTask is successfully enabled.

Full native mode

##Related work MAPREDUCE-2841 discusses about some initial experiment in "task level native optimization" while our implementation comes with far more advanced features (e.g. more key types support, Java combiner support) and has been used and verified in production environment.

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
docs		docs
patch		patch
prebuild		prebuild
scenariotest		scenariotest
src		src
.gitignore		.gitignore
DESIGN.html		DESIGN.html
DESIGN.txt		DESIGN.txt
INSTALL		INSTALL
LICENSE.txt		LICENSE.txt
README.md		README.md
TODOS		TODOS
installation_guide.txt		installation_guide.txt
pom.xml		pom.xml
prebuild.sh		prebuild.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Native MapOutputCollector mode

Full native mode

About

Releases

Packages

Languages

License

clockfly/nativetask

Folders and files

Latest commit

History

Repository files navigation

Native MapOutputCollector mode

Full native mode

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages