Quantcast
Channel: Post Feed
Viewing all articles
Browse latest Browse all 41826

Strange behaviour of BioPerl's Bio::SearchIO when parsing xml BLAST output

$
0
0

Hello, I've noticed some strange behaviour when parsing BLAST .xml output files (-oufmt 5) using BioPerl's Bio::SearchIO library.

I have a simple parser script that looks something like:

#!/usr/bin/perl -w

use strict;
use Bio::SearchIO;

my $in = Bio::SearchIO -> new (-format => 'blastxml', -file => "consensusSeqs.BLASTp.xml");
open (OUT, ">consensusSeqs.parse.OUT");

my $query_count = 1;

while( my $result = $in->next_result ) {
    print "Query count: ".$query_count."\n";
    print "Query name: ".$result->query_description."\n";
    print "Number of hits: ".$result->num_hits."\n";
    my $hit_count = 1;

    if (!defined $result->next_hit) {
        print OUT $result->query_description."\tNo hit\n";
    }

    while ( my $hit = $result->next_hit ) {
        print "\tHit count: ".$hit_count."\n";
        while ( my $hsp = $hit->next_hsp ) {
            my @a = split /\|/, $hit->name;
            my $hit_accession = $a[3];
            # print "Hit name: ".$hit->name."\n";
            print "\tHit accession: ".$hit_accession."\n";

             ## get some stats of hit
            my $percent_id = sprintf("%.2f", $hsp->percent_identity);
            my $percent_q_coverage = sprintf("%.2f", ((($hsp->length('query')/($result->query_length)))*100)); 
            my @b = split /[\[\]]/, $hit->description;
            my $organism = $b[1];

            (my $short_desc = $hit->description) =~ s/\[.*//;

            print "\t\tE-value: ".$hsp->evalue."\n";
            print "\t\t% identity: ".$percent_id."\n";
            print "\t\t% coverage: ".$percent_q_coverage."\n";
            print "\t\tDescription: ".$hit->description."\n";
            if (!defined $short_desc) {
                $short_desc = "No hits found";
            }
            print "\t\tShort description: ".$short_desc."\n";
            ## if there is nothing under [organism] 
            if (!defined $organism) {
                $organism = "not defined";
            }
            print "\t\tOrganism: ".$organism."\n";
            print "\t~~~\n";

            ## print the hit to OUT
            print OUT $result->query_description."\t".$percent_id."\t".$percent_q_coverage."\t".$organism."\t".$short_desc."\n"; 
        }
        $hit_count++;
    }
    print "\n";
    print OUT "~~~\n";
    $query_count++;
}

print "Finished\n\n";

With something like the above, I hope to capture some information for all query sequences, even if there is no hit. But I noticed that some queries were not in my output file. Looking more closely at the XML file itself, I think the issue only occurs when there is exactly one hit for a given query. In these cases, it seems to skip the while ( my $hit = $result->next_hit ) and while ( my $hsp = $hit->next_hsp ) loops and iterates directly to the next $result.

Has anyone else noticed this behaviour? Is it a bug, or is it something in the way I've set up my script? Any suggestions for a workaround would be much appreciated!

Thanks!


Viewing all articles
Browse latest Browse all 41826

Trending Articles