hadoop - Pig Join is returning no results -


i have been stuck on problem on twelve hours now. have pig script running on amazon web services. currently, running script in interactive mode. trying averages on large data set of climate readings weather stations; however, data doesn't have country or state information has joined table does.

state table:

719990 99999 lillooet                      cn ca bc wkf   +50683 -121933 +02780 719994 99999 sedco 710                     cn ca    cwqj  +46500 -048500 +00000 720000 99999 bogus american                us          -99999 -999999 -99999 720001 99999 peason ridge/range            us la k02r  +31400 -093283 +01410 720002 99999 hallock(aws)                  us mn k03y  +48783 -096950 +02500 720003 99999 deer park(aws)                us wa k07s  +47967 -117433 +06720 720004 99999 mason                         us mi k09g  +42567 -084417 +02800 720005 99999 gastonia                      us nc k0a6  +35200 -081150 +02440 

climate table: (i realize doesn't contain satisfy join condition, full data set does.)

stn--- wban   yearmoda    temp       dewp      slp        stp       visib      wdsp     mxspd   gust    max     min   prcp   sndp   frshtt 010010 99999  20090101    23.3 24    15.6 24  1033.2 24  1032.0 24   13.5  6    9.6 24   17.5  999.9    27.9*   16.7   0.00g 999.9  001000 010010 99999  20090102    27.3 24    20.5 24  1026.1 24  1024.9 24   13.7  5   14.6 24   23.3  999.9    28.9    25.3*  0.00g 999.9  001000 010010 99999  20090103    25.2 24    18.4 24  1028.3 24  1027.1 24   15.5  6    4.2 24    9.7  999.9    26.2*   23.9*  0.00g 999.9  001000 010010 99999  20090104    27.7 24    23.2 24  1019.3 24  1018.1 24    6.7  6    8.6 24   13.6  999.9    29.8    24.8   0.00g 999.9  011000 010010 99999  20090105    19.3 24    13.0 24  1015.5 24  1014.3 24    5.6  6   17.5 24   25.3  999.9    26.2*   10.2*  0.05g 999.9  001000 010010 99999  20090106    12.9 24     2.9 24  1019.6 24  1018.3 24    8.2  6   15.5 24   25.3  999.9    19.0*    8.8   0.02g 999.9  001000 010010 99999  20090107    26.2 23    20.7 23   998.6 23   997.4 23    6.6  6   12.1 22   21.4  999.9    31.5    19.2*  0.00g 999.9  011000 010010 99999  20090108    21.5 24    15.2 24   995.3 24   994.1 24   12.4  5   12.8 24   25.3  999.9    24.6*   19.2*  0.05g 999.9  011000 010010 99999  20090109    27.5 23    24.5 23   982.5 23   981.3 23    7.9  5   20.2 22   33.0  999.9    34.2    20.1*  0.00g 999.9  011000 010010 99999  20090110    22.5 23    16.7 23   977.2 23   976.1 23   11.9  6   15.5 23   35.0  999.9    28.9*   17.2   0.09g 999.9  000000 

i load in climate data using textloader, apply regular expression obtain fields, , filter out nulls result set. same state data, filter country being us.

the bags have following schema: climate_remove_empty: {station: int,wban: int,year: int,month: int,day: int,temp: double} states_filter_us: {station: int,wban: int,name: chararray,wmo: chararray,fips: chararray,state: chararray}

i need perform join operation on (station,wban) can resulting bag station, wban, year, month, , temps. when perform dump on resulting bag, says successful; however, dump returns 0 results. output.

hadoopversion   pigversion      userid  startedat       finishedat      features 1.0.3   0.9.2-amzn      hadoop  2013-05-03 00:10:51     2013-05-03 00:12:42         hash_join,filter  success!  job stats (time in seconds): jobid   maps    reduces maxmaptime      minmaptime      avgmaptime          maxreducetime   minreducetime   avgreducetime   alias   feature outputs job_201305030005_0001   2       1       36      15      25      33      33      33              climate,climate_remove_null,raw_climate,raw_states,states,states_filter_us,state_climate_jo    in   hash_join       hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203,  input(s): read 30587 records from: "hiddenbucket" read 21027 records from: "hiddenbucket"  output(s): stored 0 records in: "hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203"  counters: total records written : 0 total bytes written : 0 spillable memory manager spill count : 0 total bags proactively spilled: 0 total records proactively spilled: 0 

i have no idea why contains 0 results. data extraction seems correct. , job successful. leads me believe join condition never satisfied. know input files have data should satisfy join condition, returns absolutely nothing.

the thing looks suspicious warning states: encountered warning accessing_non_existent_field 26001 time(s).

i'm not sure go here. since job isn't failing, can't see errors or in debug.

i'm not sure if these mean anything, here other things stand out: when try illustrate state_climate_join, nullpointerexception - error 2997: encountered ioexception. exception : null

when try illustrate states, java.lang.indexoutofboundsexception: index: 1, size: 1

here full code:

--piggy bank functions register file:/home/hadoop/lib/pig/piggybank.jar define extract org.apache.pig.piggybank.evaluation.string.extract();  --load climate data raw_climate = load 'hiddenbucket' using textloader (line:chararray); raw_states= load 'hiddenbucket' using textloader (line:chararray);  climate=    foreach      raw_climate   generate        flatten ((tuple(int,int,int,int,int,double))       extract(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d{1,3}\\.\\d{1})')     )      (       station: int,   wban: int,   year: int,   month: int,   day: int,   temp: double     )   ;  states=    foreach      raw_states   generate        flatten ((tuple(int,int,chararray,chararray,chararray,chararray))       extract(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\s+)\\s+(\\w{2})\\s+(\\w{2})\\s+(\\w{2})')     )      (       station: int,   wban: int,   name: chararray,   wmo: chararray,       fips: chararray,       state: chararray       )     ;  climate_remove_null = filter climate station not null; states_filter_us = filter states (fips == 'us'); state_climate_join = join climate_remove_null (station), states_filter_us (station); 

thanks in advance. @ loss here.

--edit-- got work! regular expression parsing state_data invalid.


Comments

Popular posts from this blog

linux - Does gcc have any options to add version info in ELF binary file? -

android - send complex objects as post php java -

charts - What graph/dashboard product is facebook using in Dashboard: PUE & WUE -