linux - Grouping common elements using awk -
the following table illustrates brief snapshot of data wish manipulate. looking awk script group similar elements 1 group. eg. if @ table below:
- numbers (1,2,3,4,6) should belong 1 group. row1 row2 row4 row8 group "1"
- number 9 unique , not have common elements. reside alone in separate group group 2
- similarly numbers 5,7 reside in 1 group group 3 , on...
the file:
heading1 heading2 numberlist group name1 text 1,2,3 1 name2 text 2 1 name3 text 9 2 name4 text 1,4 1 name5 text 5,7 3 name6 text 7 3 name7 text 8 4 name8 text 6,2 1
i searching queries similar mine , found link. grouping lists common elements. solution in c++ , not awk, primary requirement.
incidentally found awk solution related query devoid of handling of comma separated values. awk script grouping array
numberlist i.e. $3 consideration grouping.
this problem seemed same 1 of problems , had used 1 column in example solve problem :) so...
[[bash_prompt$]]$ cat log ; echo "########"; \ > cat test.sh ;echo "########"; awk -f test.sh log heading1 heading2 numberlist group name1 text 1,2,3 name2 text 2 name3 text 9 name4 text 1,4 name5 text 5,7 name6 text 7 name7 text 8 name8 text 6,2 ######## /^name/{ i=0; j=0; split($3,a,","); for(var in a) { for(var1 in q) { split(q[var1],r,","); for(var2 in r) { if(r[var2] == a[var]) { i=1; j=((var1+1)); } } } } if(i == 0) { q[length(q)] = $3; j=length(q); } print $1 "\t\t" $2 " \t\t" $3 "\t\t" j; } ######## name1 text 1,2,3 1 name2 text 2 1 name3 text 9 2 name4 text 1,4 1 name5 text 5,7 3 name6 text 7 3 name7 text 8 4 name8 text 6,2 1 [[bash_prompt$]]$
update:
split
splits first argument delimiter passed in third argument , puts array pointed second argument. here main array q, holds group members of group, it's array of arrays index of element group id, , element collection members of group. q[0]="1,2,3"
indicates 0th group containing members 1
,2
, 3
. in awk, first 1 line read starts name
(/^name/
). 3rd field (1,2,3
) broken down array a. each element in array a, go each group stored q (for(var1 in q)
) , inside each group, split them temporary array r (split(q[var1],r,",")
), i.e. "1,2,3" split array r. each element in r compared element in a. if match found, group's index index of row (array index starts 0, group's 1, ((var1+1)) used. if not found, add new group in q , last index + 1, i.e. length of array index row
update:
/^name/{ j=0; split($3,a,","); for(var in a) { if(q[a[var]] != 0) { j=q[a[var]]; i=1; break; } } j = (j == 0) ? ++k : j; for(var1 in a) { if(q[a[var1]] == 0) { q[a[var1]] = j; } } print $1 "\t\t" $2 " \t\t" $3 "\t\t" j; }
update:
base awk has associative array , each element accessed string key. earlier approach store each group in array key index of group. when reading column, read each group, split group in individual element, compare each of element each element of column. instead of storing group, if store elements in array key element , value @ key index of group element belongs. when read column, split column in individual element (split($3,a,",");
) check element in array if there group index element key in if(q[a[var]] != 0)
( in awk, if element not there, default element value 0 initialized there, check q[a[var]] != 0
). if element found, take element's group index index of column , break. else j
remain 0. if j
remains 0, ++k
gives latest group index. found group index column elements. need carry index elements not part of other group( there cases multiple elements in same column belongs different group, here taking first come, first serve approach, not on write group index of others belonging group). each element in column (for(var1 in a)
) , if not belong group (if(q[a[var1]] == 0)
) , give group index q[a[var1]] = j;
. here accesses linear because accessing using elements directly key. no breaking group again , again every element , hence shorter time. first approach based on 1 of own problem ( mentioned in first line ) more complex processing shorter data set. 1 required simpler straight forward logic.
Comments
Post a Comment