hadoop - How to run external program within mapper or reducer giving HDFS files as input and storing output files in HDFS? -

i have external program take file input , give output file

     //for example       input file: in_file      output file: out_file      //run external program       ./vx < ${in_file} > ${out_file}

i want both input , output files in hdfs

i have cluster 8 nodes.and have 8 input files each have 1 line

    //1 input file :       1.txt             1:0,0,0     //2 input file :       2.txt             2:0,0,128     //3 input file :       3.txt             3:0,128,0     //5 input file :       4.txt             4:0,128,128     //5 input file :       5.txt             5:128,0,0     //6 input file :       6.txt             6:128,0,128     //7 input file :       7.txt             7:128,128,0     //8 input file :       8.txt             8:128,128,128

i using keyvaluetextinputformat

               key :file name                value: initial coordinates

for example 5th file

              key :5               value:128,0,0

each map tasks generate huge amount of data according initial coordinates.

now want run external program in each map task , generate output file.

but confuse how files in hdfs .

         can use 0 reducer , create file in hdfs            configuration conf = new configuration();          filesystem fs = filesystem.get(conf);          path outfile;          outfile = new path(input_file_name);          fsdataoutputstream out = fs.create(outfile);           //generating data ........ , writing hdfs            out.writeutf(lon + ";" + lat + ";" + depth + ";");

i confuse how run external program hdfs file without getting file file local directory .

   dfs -get

without using mr getting results shell script following

#!/bin/bash  if [ $# -lt 2 ];     printf "usage: %s: <infile> <outfile> \n" $(basename $0) >&2           exit 1 fi  in_file=/users/x34/data/$1 out_file=/users/x34/data/$2                       cd "/users/x34/projects/externalprogram/model/"  ./vx < ${in_file} > ${out_file}  paste ${in_file} ${out_file} | awk '{print $1,"\t",$2,"\t",$3,"\t",$4,"\t",$5,"\t",$22,"\t",$23,"\t",$24}' > /users/x34/data/combined if [ $? -ne 0 ];     exit 1 fi                        exit 0

and run with

         processbuilder pb = new processbuilder("shell_script","in", "out");           process p = pb.start();

i appreciate idea how use hadoop streaming or other way run external program .i want both input , output files in hdfs further processing .

please

so assuming external program doesnt know how recognize or read hdfs, want load in file java , pass input directly program

path path = new path("hdfs/path/to/input/file"); filesystem fs = filesystem.get(configuration); fsdatainputstream fin = fs.open(path); processbuilder pb = new processbuilder("shell_script"); process p = pb.start(); outputstream os = p.getoutputstream(); bufferedreader br = new bufferedreader(new inputstreamreader(fin)); bufferedwriter writer = new bufferedwriter(new outputstreamwriter(os));  string line = null; while ((line = br.readline())!=null){     writer.write(line); }

the output can done in reverse manner. inputstream process, , make fsdataoutputstream write hdfs.

essentially program these 2 things becomes adapter converts hdfs input , output hdfs.

Search This Blog

Brande

hadoop - How to run external program within mapper or reducer giving HDFS files as input and storing output files in HDFS? -

Comments

Post a Comment

Popular posts from this blog

linux - Does gcc have any options to add version info in ELF binary file? -

android - send complex objects as post php java -

java - Are there any classes that implement javax.persistence.Parameter<T>? -