Quantcast
Viewing all articles
Browse latest Browse all 41826

Errors in a bioperl script, please help me to have a check, thanks in advance!

Dear friends:

I have a blast report, and want to extract following information for each result(query): the Query_name, hit_number, name and description of the hit(HSP) with the highest identity.

my bioperl script is:

#!/usr/bin/perl -w
use strict;
use warnings;
use Bio::SearchIO;

if (@ARGV != 2){die "Usage: $0 <BLAST-report-file> <output-file>\n"};

my ($infile,$outfile) = @ARGV;
print "Parsing the BLAST result ...";
my $blast = new Bio::SearchIO(
-format => 'blast',
-file   => $infile);
open (OUT,">$outfile") or die "Cannot open $outfile: $!";

print OUT "query\tNumber of hits\tGreatest identity %\tAccession (identity %)\tDescription (identity %)\n";

while (my $result = $blast->next_result){
   print OUT $result->query_name . "\t";
   print OUT $result->num_hits. "\t";
        # don't need the '&':   if (my $hit = sort_hit($result))   
        if (my $hit = &sort_hit($result)){

                         if (my $hsp = $hit->hsp){
                                        print OUT $hsp->percent_identity. "\t"; 
                                        print OUT $hit->accession. "\t";
                                        print OUT $hit->description. "\n";
                            }                   
           }
    }
 close OUT;
 print " DONE!!!\n";
 # the problem starts here:
 # no matter what, this function will always return the first hit
 sub sort_hit{
                  my $result = shift;
                  my @hits;

                  # The following while loop is very clumsy, ResultI has a hits method:
                  # better:  my @hits = $result->hits();
                  while (my $hit = $result->next_hit) {
                        push @hits, $hit;
                        }
                  # here you sort the @hits, assign the result and do not use the result ever again.
                  # sorting the array has no effect on the original hits, that is also why I recommended to
                  # use $result->sort_hits instead
                  # minor: you don't need to multiply percent_identity by 100
                  my @hit_sort = sort { $b-> hsp ->percent_identity * 100 <=> $a-> hsp ->percent_identity * 100 } @hits;

                  # the following code is possibly meant to return the first element of the array, but
                  # is totally without use. You do not need to iterate again.
                  # Correct would be: 
                  return shift @hit_sort; # you do know shift, unshift, push and pop for array operations?
                  # rest of code doesn't do anything sensible:
                  $result->rewind;
                  my $flag=0;
                  my $temp_hit;
                  # At the end of this while loop, $temp_hit will always be the first hit!

                  while (my $hit=$result->next_hit){
                        if($flag==0){ # this is false already in the second iteration
                           $temp_hit=$hit;
                     $flag++;
                           }
                  return $temp_hit
                        }
         }

error information:

-------------EXCEPTION: Bio::Root::Exception -------------
MSG: Can't get HSPs: data not collected.
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/xiongtl/Bioperl/dag/lib/perl5/Bio/Root/Root.pm:368
STACK: Bio::Search::Hit::GenericHit::hsp /home/xiongtl/Bioperl/dag/lib/perl5/Bio/Search/Hit/GenericHit.pm:688
STACK: main::sort_hit xtl_new.pl:37
STACK: xtl_new.pl:20
-----------------------------------------------------------
Parsing the BLAST result ...

please give me some clues, all your words are welcome!

NOTE: My bioperl version is 1.6.1, and I use blastx 2.2.27+. My blast report(-outfmt 0) have many queries and each query have many hits(against nr database), so each hit maybe also have several HSPs.

Some more information: I tried the $hit->next_hsp before, but running the script still informed me the same error. However, today I have just reruned the script, both $hit->next_hsp and $hit->hsp are ok! They both outputed the first hit for each result. I don't know why. So I tried different sort rules as follows:

# my @hit_sort = sort { $b-> hsp ->hsp_length <=> $a-> hsp ->hsp_length } @hits;
# my @hit_sort = sort { $b-> hsp ->percent_identity * 100 <=> $a-> hsp ->percent_identity * 100 } @hits;
# my @hit_sort = sort { $b-> significance <=> $a-> significance } @hits;
# my @hit_sort = sort { $a-> accession cmp $b-> accession } @hits;

but different sort functions still generate the same output (still the first hit)! It seems that the sub-function does not work at all! oh, it really troubles me...

Here is my data structure:

Query= scaffold28:2776872-2777385(-)

Length=513
                                                                     Score     E
Sequences producing significant alignments:                          (Bits)  Value

gb|AEA49877.1|  RNA-dependent RNA polymerase [Drosophila ananassa...  90.5    3e-30
...
...
...
gb|AAA48442.1|  L protein [Vesicular stomatitis virus]                 115    2e-26


>gb|AEA49877.1| RNA-dependent RNA polymerase [Drosophila ananassae sigmavirus]
Length=615

Score = 90.5 bits (223),  Expect(2) = 3e-30, Method: Composition-based stats.
Identities = 40/57 (70%), Positives = 43/57 (75%), Gaps = 0/57 (0%)
Frame = +2

Query  173  ISIANHIDYEKWNAHQRGTGNDPCFKVMGQFLGLPNSILRTNEFFSKITFYYNQRPD  343
            I+IANHIDYEKWN HQRG  N P FKVMGQFLG PN I RT+EFF K   YY+ R D
Sbjct  554  ITIANHIDYEKWNNHQRGDANGPVFKVMGQFLGYPNLIARTHEFFEKSLIYYSDRAD  610


Score = 67.0 bits (162),  Expect(2) = 3e-30, Method: Compositional matrix adjust.
Identities = 39/54 (72%), Positives = 46/54 (85%), Gaps = 0/54 (0%)
Frame = +1

Query  1    FSLMSWRLREYFVITEYLIKLHFVPLFKGITMAddlttlikklldtSSGQALTD  162
            FSLMSW LR+YFV TEYLIK HFVPLF+G+TMADDLTT+ +K+L +SSGQ   D
Sbjct  497  FSLMSWELRDYFVFTEYLIKTHFVPLFEGLTMADDLTTVTQKMLSSSSGQGNDD  550

>....
>....
>....

>gb|AAA48442.1| L protein [Vesicular stomatitis virus]
Length=2109

Score =  115 bits (289),  Expect = 2e-26, Method: Compositional matrix adjust.
Identities = 81/172 (47%), Positives = 104/172 (60%), Gaps = 3/172 (2%)
Frame = +1

Query  1    FSLMSWRLREYFVITEYLIKLHFVPLFKGITMAddlttlikklldtSSGQALTDyrsyfy  180
            FSLMSWRLREYFVITEYLIK ++VPLFKG+TMADDLT++IKK++D+SSGQ L DY S   
Sbjct  541  FSLMSWRLREYFVITEYLIKTYYVPLFKGLTMADDLTSVIKKMMDSSSGQGLDDYSSVCL  600

Query  181  s*syrL*KMECSPTRNRE*SMFQGHGSIFRSAQLNTQN**IF**NYFLL*STTRPD-MIM  357
            +      K      +     +F+  G       L  +    F     L+    RPD M +
Sbjct  601  ANHIDYEKWNNHQRKESNGPIFRVMGQFLGYPSLIERTHEFF--EKSLIYYNGRPDLMTI  658

Query  358  NNGVIENLTDQRVCWNGQLGGQEGIGQRGCCVVNALAIERECQDQNIKVSLL  513
             NG + N T  RVCWNGQ GG EG+ Q+G  +VN L I+RE + +N  V +L
Sbjct  659  RNGTLCNSTKHRVCWNGQKGGLEGLRQKGWSIVNLLVIQREAKIRNTAVKVL  710



Lambda      K        H
   0.318    0.134    0.401 

Gapped
Lambda      K        H
0.267   0.0410    0.140 

Effective search space used: 304649109255

Viewing all articles
Browse latest Browse all 41826

Trending Articles