Hadoop MapReduce, Java implementation questions -


currently i'm apache hadoop (with java implementation of mapreduce jobs). looked examples (like wordcount example). have success playing around writing custom mapreduce apps (i'm using cloudera hadoop demo vm). question implementation , runtime questions.

the prototype of job class follows:

public class wordcount {    public static class map extends mapreducebase implements mapper<longwritable, text, text, intwritable> {     public void map(longwritable key, text value, outputcollector<text, intwritable> output, reporter reporter) throws ioexception {     // mapping       }     }   }    public static class reduce extends mapreducebase implements reducer<text, intwritable, text, intwritable> {     public void reduce(text key, iterator<intwritable> values, outputcollector<text, intwritable> output, reporter reporter) throws ioexception {       // reducing     }   }    public static void main(string[] args) throws exception {     jobconf conf = new jobconf(wordcount.class);     conf.setjobname("wordcount");     // setting map , reduce classes, , various configs     jobclient.runjob(conf);   } } 

i have questions, tried google them, must tell documentation on hadoop formal (like big reference book), not suitable beginners.

my questions:

  • does map , reduce classes have static inner classes in main class, or can anywhere (just visible main?)
  • can use java se , available libraries have offer in ordinary java se app? mean, jaxb, guava, jackson json, etc
  • what best practice write generic solutions? mean: want process big amounts of log files in different (but similar) ways. last token of log file json map entries. 1 processing be: count , group log rows on (keya, keyb map), , be: count , group log rows on (keyx, keyy map). (i'm thinking of configfile-based solution, can provide necessary entries program, if need new resolution, have provide config , run app).
  • can relevant: in wordcount example map , reduce classes static inner classes , main() has 0 influence on them, provides these classes framework. can make these classes non-static, provide fields , constructor alter runtime current values (like config parameters mentioned).

maybe i'm digging in details unnecessarily. overall question is: hadoop mapreduce program still normal javase app used to?

here answers.

  1. the mapper , reducer classes can in separate java classes, anywhere in package structure or may in seperate jar files long class loader of maptask/reducetask able load mapper/reducer classes. example shown quick testing hadoop beginners.

  2. yes, can use java libraries. these third party jars should made available maptask/reducetask either through -files option of hadoop jar command or using hadoop api. @ link here more information on adding third party libraries map/reduce classpath

  3. yes, can configure , pass in configurations map/reduce jobs using either of these approaches.

    3.1 use org.apache.hadoop.conf.configuration object below set configurations in client program (the java class main() method

    configuration conf = new configuration(); conf.set("config1", "value1"); job job = new job(conf, "whole file input");

the map/reduce programs have access configuration object , values set properties using get() method. approach advisable if configuration settings small.

3.2 use distributed cache load configurations , make available in map/reduce programs. click here details on distributed cache. approach more advisable.

4.the main() client program should responsible configuring , submitting hadoop job. if none of configurations set, default settings used. configurations such mapper class, reducer class, input path, output path, input format class, number of reducers etc. eg:

additionally, @ documentation here on job configuration

yes, map/reduce programs still javase programs however, these distributed across machines in hadoop cluster. lets say, hadoop cluster has 100 nodes , submitted word count example. hadoop framework creates java process each of these map , reduce tasks , calls call methods such map()/reduce() on subset of machines data exists. essentially, mapper/reducer code gets executed on machine data exists. recommend read chapter 6 of the definitive guide

i hope, helps.


Comments

Popular posts from this blog

php - Why I am getting the Error "Commands out of sync; you can't run this command now" -

linux - Does gcc have any options to add version info in ELF binary file? -

java - Are there any classes that implement javax.persistence.Parameter<T>? -