InputSplit customization in Hadoop -
i understand in hadoop, large input file splits small files , gets processed in different nodes map functions. got know can customize inputsplit
s. know if following type of customization possible inputsplit
:
i have large input file coming in hadoop, want subset of file, i.e. set of lines in file go along every input split. mean data chunks of large file should contain these set of lines, irrespective of whatever way file split.
to make question more clear, if need compare part of input file (say a
) rest of file content, in case inputsplit
s going map
function need have a
part comparison. kindly guide me on this.
theoretically possible split big file (a, b, c, d, ...)
splits (a, b), (a, c), (a, d), ...
. you'd have write lot of custom classes purpose. filesplit, extends inputsplit, says split file
begins @ position start
, has fixed length
. actual access file done recordreader
, i.e. linerecordreader. have implement code, read not actual split, header (part a
) of file well.
i'd argue, approach you're looking unpractical. reason record reader accesses positions (start, start+length) data-locality. big file, parts a
, z
on 2 different nodes.
depending on size of part a
, better idea storing common part in distributedcache. in way access common data in each of mappers in efficient way. refer javadoc , http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata further information.
Comments
Post a Comment