python - search sequence in genome with mismatches -
i have fastq file more 100 million reads in , genome sequence of 10000 in length
i want take out sequences fastq file , search in genome sequence allowing 3 mismatches
i tried in way using awk got sequences fastq file:
1.fq(few lines)
@dh1dqqn1:269:c1ukcacxx:1:1101:1207:2171 1:n:0:ttaggc natccccatcctctgcttgcttttcgggatatgttgtaggattctcagc
+
1=adbddhd;f>gf@ffefgggiaeei?d9ddhhigaaf:bg39?bb
@dh1dqqn1:269:c1ukcacxx:1:1101:1095:2217 1:n:0:ttaggc taggatttcaaatgggtcgaggtggtccgttaggtataggggcaacagg
+
??aabdd4c:dddi+c:c3@:c):1?*):?)?################
$ awk 'nr%4==2' 1.fq
natccccatcctctgcttgcttttcgggatatgttgtaggattctcagc taggatttcaaatgggtcgaggtggtccgttaggtataggggcaacagg
i have sequences in file,now want take each line of sequence , search in genome sequence allowing 3 mismatches , if finds print sequences
example:
genome sequence file:
ggggaggaatatgatttacagtttatttttcaactgtgcaaaataaccttaactgcagacgttatgacatacatacattctatgaattccactattttggaggactggaattttggtctacaacctcccccaggaggcacactagaagatacttataggtttgtaacccaggcaattgcttgtcaaaaacataca
search sequence file:
ggggaggaatatgat
ggggaggaatatgaa
ggggaggaatatgcc
tcaaaaacatagg
tcaaaaacatggg
output file:
ggggaggaatatgat 0 # 0 mismatch exact sequence
ggggaggaatatgaa 1 # 1 mismatch
ggggaggaatatgcc 2 # 2 mismatch
tcaaaaacatagg 2 # 2 mismatch
tcaaaaacatggg 3 # 3 mismatch
something like?
use 5.012; use strict; use warnings; use string::approx qw(aslice); use file::slurp; use data::dumper; $genseq = "gseq.txt"; #the long sequence $_ = read_file($genseq); #read small patterns stdin while(my $patt = <>) { chomp $patt; $len = length($patt); my($index, $size, $distance) = aslice($patt, ["3d0s3", "minimal_distance"]); "$patt matched approx. @ $index mismatch $distance" if $distance <= 3; }
for input produces:
ggggaggaatatgat matched approx. @ 0 mismatch 0 ggggaggaatatgaa matched approx. @ 0 mismatch 1 ggggaggaatatgcc matched approx. @ 0 mismatch 2 tcaaaaacatagg matched approx. @ 179 mismatch 2 tcaaaaacatggg matched approx. @ 179 mismatch 3
honestly, haven't idea how work 10000 chars long genseq...
Comments
Post a Comment