java - The loss of key-value pair of Map output -
i wrote hadoop program on input of mapper text file hdfs://192.168.1.8:7000/export/hadoop-1.0.1/bin/input/paths.txt
written ways of local file system (which identical on computers of cluster) program ./readwritepaths
in 1 line , partitioned character |
. @ first in mapper there reading quantity of subordinate nodes of cluster /usr/countcomputers.txt
file, equally 2 read correctly, judging program execution. further contents of input file arrived in form of value on input of mapper , transformed line, segmented means of separator |
, received ways added in arraylist<string> paths
.
package org.myorg; import java.io.*; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class parallelindexation { public static class map extends mapreducebase implements mapper<longwritable, text, text, longwritable> { private final static longwritable 0 = new longwritable(0); private text word = new text(); public void map(longwritable key, text value, outputcollector<text, longwritable> output, reporter reporter) throws ioexception { string line = value.tostring(); int countcomputers; fileinputstream fstream = new fileinputstream( "/usr/countcomputers.txt"); bufferedreader br = new bufferedreader(new inputstreamreader(fstream)); string result=br.readline(); countcomputers=integer.parseint(result); in.close(); fstream.close(); system.out.println("countcomputers="+countcomputers); arraylist<string> paths = new arraylist<string>(); stringtokenizer tokenizer = new stringtokenizer(line, "|"); while (tokenizer.hasmoretokens()) { paths.add(tokenizer.nexttoken()); }
then check take out values of arraylist<string> paths
elements /export/hadoop-1.0.1/bin/readpathsfromdatabase.txt
file contents given below , speaks correctness of filling of arraylist<string> paths
.
printwriter zzz = null; try { zzz = new printwriter(new fileoutputstream("/export/hadoop-1.0.1/bin/readpathsfromdatabase.txt")); } catch(filenotfoundexception e) { system.out.println("error"); system.exit(0); } (int i=0; i<paths.size(); i++) { zzz.println("paths[" + + "]=" + paths.get(i) + "\n"); } zzz.close();
then concatenation of these ways through character \n
, record of connected results in array string[] concatpaths = new string [countcomputers]
made.
string[] concatpaths = new string[countcomputers]; int numberofelementconcatpaths = 0; if (paths.size() % countcomputers == 0) { (int = 0; < countcomputers; i++) { concatpaths[i] = paths.get(numberofelementconcatpaths); numberofelementconcatpaths += paths.size() / countcomputers; (int j = 1; j < paths.size() / countcomputers; j++) { concatpaths[i] += "\n" + paths.get(i * paths.size() / countcomputers + j); } } } else { numberofelementconcatpaths = 0; (int = 0; < paths.size() % countcomputers; i++) { concatpaths[i] = paths.get(numberofelementconcatpaths); numberofelementconcatpaths += paths.size() / countcomputers + 1; (int j = 1; j < paths.size() / countcomputers + 1; j++) { concatpaths[i] += "\n" + paths.get(i * (paths.size() / countcomputers + 1) + j); } } (int k = paths.size() % countcomputers; k < countcomputers; k++) { concatpaths[k] = paths.get(numberofelementconcatpaths); numberofelementconcatpaths += paths.size() / countcomputers; (int j = 1; j < paths.size() / countcomputers; j++) { concatpaths[k] += "\n" + paths.get((k - paths.size() % countcomputers) * paths.size() / countcomputers + paths.size() % countcomputers * (paths.size() / countcomputers + 1) + j); } } }
i take out array cells string[] concatpaths
/export/hadoop-1.0.1/bin/concatpaths.txt
file check correctness of concatenation. text of file received , given below speaks correctness of previous operation stages.
printwriter zzz1 = null; try { zzz1 = new printwriter(new fileoutputstream("/export/hadoop-1.0.1/bin/concatpaths.txt")); } catch(filenotfoundexception e) { system.out.println("error"); system.exit(0); } (int = 0; < concatpaths.length; i++) { zzz1.println("concatpaths[" + + "]=" + concatpaths[i] + "\n"); } zzz1.close();
on output of mapper array cells string[] concatpaths
- connected ways arrive.
(int = 0; < concatpaths.length; i++) { word.set(concatpaths[i]); output.collect(word, zero); }
in reducers there partition of input keys on part means of separator \n
, record of received ways in arraylist<string> processedpaths
.
public static class reduce extends mapreducebase implements reducer<text, intwritable, text, longwritable> { public native long traveser(string path); public native void configure(string path); public void reduce(text key, iterator<intwritable> value, outputcollector<text, longwritable> output, reporter reporter) throws ioexception { long count=0; string line = key.tostring(); arraylist<string> processedpaths = new arraylist<string>(); stringtokenizer tokenizer = new stringtokenizer(line, "\n"); while (tokenizer.hasmoretokens()) { processedpaths.add(tokenizer.nexttoken()); }
further validation of separation of separate ways bring elements out of connected keys arraylist<string> processedpaths
in /export/hadoop-1.0.1/bin/processedpaths.txt
file. contents of file on both subordinate nodes appeared equally , represented separate ways second connected key , in spite of fact on output of mepper arrived 2 different connected ways. surprising - result of operation of subsequent lines of reducer realize file indexing on received ways, introduction of words these files in database table, 1 file - /export/hadoop-1.0.1/bin/error.txt
belongs first connected key indexed.
printwriter zzz2 = null; try { zzz2 = new printwriter(new fileoutputstream("/export/hadoop-1.0.1/bin/processedpaths.txt")); } catch(filenotfoundexception e) { system.out.println("error"); system.exit(0); } (int i=0; < processedpaths.size(); i++) { zzz2.println("processedpaths[" + + "]=" + processedpaths.get(i) + "\n"); } zzz2.close(); configure("/etc/nsindexer.conf"); (int = 0; < processedpaths.size(); i++) { count = traveser(processedpaths.get(i)); } output.collect(key, new longwritable(count));
execution of program happened of following of bash of script
#!/bin/bash cd /export/hadoop-1.0.1/bin ./hadoop namenode -format ./start-all.sh ./hadoop fs -rmr hdfs://192.168.1.8:7000/export/hadoop-1.0.1/bin/output ./hadoop fs -rmr hdfs://192.168.1.8:7000/export/hadoop-1.0.1/bin/input ./hadoop fs -mkdir hdfs://192.168.1.8:7000/export/hadoop-1.0.1/input ./readwritepaths sleep 120 ./hadoop fs -put /export/hadoop-1.0.1/bin/input/paths.txt hdfs://192.168.1.8:7000/export/hadoop-1.0.1/bin/input/paths.txt 1> copyinhdfs.txt 2>&1 ./hadoop jar /export/hadoop-1.0.1/bin/parallelindexation.jar org.myorg.parallelindexation /export/hadoop-1.0.1/bin/input /export/hadoop-1.0.1/bin/output -d mapred.map.tasks=1 -d mapred.reduce.tasks=2 1> resultofexecute.txt 2>&1
according last command mepper shall one. despite these files /export/hadoop- 1.0.1/bin/readpathsfromdatabase.txt
, /export/hadoop-1.0.1/bin/concatpaths.txt
appeared on both subordinate nodes. give contents of above-mentioned files hdfs://192.168.1.8:7000/export/hadoop-1.0.1/bin/input/paths.txt
/export/hadoop-1.0.1/bin/error.txt|/root/nexenta_search/nsindexer.conf|/root/nexenta_search/traverser.c|/root/nexenta_search/buf_read.c|/root/nexenta_search/main.c|/root/nexenta_search/avl_tree.c|
/export/hadoop-1.0.1/bin/readpathsfromdatabase.txt
paths[0]=/export/hadoop-1.0.1/bin/error.txt paths[1]=/root/nexenta_search/nsindexer.conf paths[2]=/root/nexenta_search/traverser.c paths[3]=/root/nexenta_search/buf_read.c paths[4]=/root/nexenta_search/main.c paths[5]=/root/nexenta_search/avl_tree.c
/export/hadoop-1.0.1/bin/concatpaths.txt
concatpaths[0]=/export/hadoop-1.0.1/bin/error.txt /root/nexenta_search/nsindexer.conf /root/nexenta_search/traverser.c concatpaths[1]=/root/nexenta_search/buf_read.c /root/nexenta_search/main.c /root/nexenta_search/avl_tree.c
/export/hadoop-1.0.1/bin/processedpaths.txt
processedpaths[0]=/root/nexenta_search/buf_read.c processedpaths[1]=/root/nexenta_search/main.c processedpaths[2]=/root/nexenta_search/avl_tree.c
in connection want ask 3 questions:
- why texts of
/export/hadoop-1.0.1/bin/processedpaths.txt
files on both nodes identical , such, provided here? - why 1 file -
/export/hadoop-1.0.1/bin/error.txt
result indexed? - why mapper executed on both subordinate nodes?
Comments
Post a Comment