hadoop - Pig Join is returning no results -
i have been stuck on problem on twelve hours now. have pig script running on amazon web services. currently, running script in interactive mode. trying averages on large data set of climate readings weather stations; however, data doesn't have country or state information has joined table does.
state table:
719990 99999 lillooet cn ca bc wkf +50683 -121933 +02780 719994 99999 sedco 710 cn ca cwqj +46500 -048500 +00000 720000 99999 bogus american us -99999 -999999 -99999 720001 99999 peason ridge/range us la k02r +31400 -093283 +01410 720002 99999 hallock(aws) us mn k03y +48783 -096950 +02500 720003 99999 deer park(aws) us wa k07s +47967 -117433 +06720 720004 99999 mason us mi k09g +42567 -084417 +02800 720005 99999 gastonia us nc k0a6 +35200 -081150 +02440
climate table: (i realize doesn't contain satisfy join condition, full data set does.)
stn--- wban yearmoda temp dewp slp stp visib wdsp mxspd gust max min prcp sndp frshtt 010010 99999 20090101 23.3 24 15.6 24 1033.2 24 1032.0 24 13.5 6 9.6 24 17.5 999.9 27.9* 16.7 0.00g 999.9 001000 010010 99999 20090102 27.3 24 20.5 24 1026.1 24 1024.9 24 13.7 5 14.6 24 23.3 999.9 28.9 25.3* 0.00g 999.9 001000 010010 99999 20090103 25.2 24 18.4 24 1028.3 24 1027.1 24 15.5 6 4.2 24 9.7 999.9 26.2* 23.9* 0.00g 999.9 001000 010010 99999 20090104 27.7 24 23.2 24 1019.3 24 1018.1 24 6.7 6 8.6 24 13.6 999.9 29.8 24.8 0.00g 999.9 011000 010010 99999 20090105 19.3 24 13.0 24 1015.5 24 1014.3 24 5.6 6 17.5 24 25.3 999.9 26.2* 10.2* 0.05g 999.9 001000 010010 99999 20090106 12.9 24 2.9 24 1019.6 24 1018.3 24 8.2 6 15.5 24 25.3 999.9 19.0* 8.8 0.02g 999.9 001000 010010 99999 20090107 26.2 23 20.7 23 998.6 23 997.4 23 6.6 6 12.1 22 21.4 999.9 31.5 19.2* 0.00g 999.9 011000 010010 99999 20090108 21.5 24 15.2 24 995.3 24 994.1 24 12.4 5 12.8 24 25.3 999.9 24.6* 19.2* 0.05g 999.9 011000 010010 99999 20090109 27.5 23 24.5 23 982.5 23 981.3 23 7.9 5 20.2 22 33.0 999.9 34.2 20.1* 0.00g 999.9 011000 010010 99999 20090110 22.5 23 16.7 23 977.2 23 976.1 23 11.9 6 15.5 23 35.0 999.9 28.9* 17.2 0.09g 999.9 000000
i load in climate data using textloader, apply regular expression obtain fields, , filter out nulls result set. same state data, filter country being us.
the bags have following schema: climate_remove_empty: {station: int,wban: int,year: int,month: int,day: int,temp: double} states_filter_us: {station: int,wban: int,name: chararray,wmo: chararray,fips: chararray,state: chararray}
i need perform join operation on (station,wban) can resulting bag station, wban, year, month, , temps. when perform dump on resulting bag, says successful; however, dump returns 0 results. output.
hadoopversion pigversion userid startedat finishedat features 1.0.3 0.9.2-amzn hadoop 2013-05-03 00:10:51 2013-05-03 00:12:42 hash_join,filter success! job stats (time in seconds): jobid maps reduces maxmaptime minmaptime avgmaptime maxreducetime minreducetime avgreducetime alias feature outputs job_201305030005_0001 2 1 36 15 25 33 33 33 climate,climate_remove_null,raw_climate,raw_states,states,states_filter_us,state_climate_jo in hash_join hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203, input(s): read 30587 records from: "hiddenbucket" read 21027 records from: "hiddenbucket" output(s): stored 0 records in: "hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203" counters: total records written : 0 total bytes written : 0 spillable memory manager spill count : 0 total bags proactively spilled: 0 total records proactively spilled: 0
i have no idea why contains 0 results. data extraction seems correct. , job successful. leads me believe join condition never satisfied. know input files have data should satisfy join condition, returns absolutely nothing.
the thing looks suspicious warning states: encountered warning accessing_non_existent_field 26001 time(s).
i'm not sure go here. since job isn't failing, can't see errors or in debug.
i'm not sure if these mean anything, here other things stand out: when try illustrate state_climate_join, nullpointerexception - error 2997: encountered ioexception. exception : null
when try illustrate states, java.lang.indexoutofboundsexception: index: 1, size: 1
here full code:
--piggy bank functions register file:/home/hadoop/lib/pig/piggybank.jar define extract org.apache.pig.piggybank.evaluation.string.extract(); --load climate data raw_climate = load 'hiddenbucket' using textloader (line:chararray); raw_states= load 'hiddenbucket' using textloader (line:chararray); climate= foreach raw_climate generate flatten ((tuple(int,int,int,int,int,double)) extract(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d{1,3}\\.\\d{1})') ) ( station: int, wban: int, year: int, month: int, day: int, temp: double ) ; states= foreach raw_states generate flatten ((tuple(int,int,chararray,chararray,chararray,chararray)) extract(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\s+)\\s+(\\w{2})\\s+(\\w{2})\\s+(\\w{2})') ) ( station: int, wban: int, name: chararray, wmo: chararray, fips: chararray, state: chararray ) ; climate_remove_null = filter climate station not null; states_filter_us = filter states (fips == 'us'); state_climate_join = join climate_remove_null (station), states_filter_us (station);
thanks in advance. @ loss here.
--edit-- got work! regular expression parsing state_data invalid.
Comments
Post a Comment