InputSplit customization in Hadoop -


i understand in hadoop, large input file splits small files , gets processed in different nodes map functions. got know can customize inputsplits. know if following type of customization possible inputsplit:

i have large input file coming in hadoop, want subset of file, i.e. set of lines in file go along every input split. mean data chunks of large file should contain these set of lines, irrespective of whatever way file split.

to make question more clear, if need compare part of input file (say a) rest of file content, in case inputsplits going map function need have a part comparison. kindly guide me on this.

theoretically possible split big file (a, b, c, d, ...) splits (a, b), (a, c), (a, d), .... you'd have write lot of custom classes purpose. filesplit, extends inputsplit, says split file begins @ position start , has fixed length. actual access file done recordreader, i.e. linerecordreader. have implement code, read not actual split, header (part a) of file well.

i'd argue, approach you're looking unpractical. reason record reader accesses positions (start, start+length) data-locality. big file, parts a , z on 2 different nodes.

depending on size of part a, better idea storing common part in distributedcache. in way access common data in each of mappers in efficient way. refer javadoc , http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata further information.


Comments

Popular posts from this blog

linux - Does gcc have any options to add version info in ELF binary file? -

android - send complex objects as post php java -

charts - What graph/dashboard product is facebook using in Dashboard: PUE & WUE -