Quantcast
Channel: Post Feed
Viewing all articles
Browse latest Browse all 41826

How to extract desire genes blast XML result from a big blast XML file

$
0
0

Dear community,

I have a XML file contained 50,000 genes Blast result with 10 hits for each gene. I want to mine desire genes Blast result from that XML file and the output file is still in XML format. The output file should have a full Blast result in XML form of desire genes.

I tired with the Python script shared by juefish on this posts, however, I only get partial of one gene Blast XML result parsed instead a list of my desire genes. I quote juefish's Python script here to make the discussion easier.

#!/usr/bin/env python

import sys
import os
import sets
import Bio

from sets import Set
from Bio.Blast import NCBIXML

# Usage.
if len(sys.argv) < 2:
        print ""
        print "This program extracts blast results from an xml file given a list of query sequences"
        print "Usage: %s -list file1 -xml file2 > outfile"
        print "-list: list of sequence names"
        print "-xml: blast xml output file"
        print ""
        sys.exit()

# Parse args.
for i in range(len(sys.argv)):
        if sys.argv[i] == "-list":
                infile1 = sys.argv[i+1]
        elif sys.argv[i] == "-xml":
                infile2 = sys.argv[i+1]

fls = [infile1,infile2]
results_handle = open(fls[1], "r")
fin1 = open(fls[0],"r")
geneContigs = Set([])

#establish list of names of queries to extract from xml file
for line in fin1:
    temp=line.lstrip('>').split()
    geneContigs.add(temp[0])
fin1.close()

genes2 = []
marker = False
header = True

#cycle through xml file and printing results that match to query list
for line in results_handle:
        line = line.rstrip('\r\n')
        if(header==True):
                if(line.count("<Iteration>")>0):
                        header=False
                        print "%s" % (line)
                        continue
                print "%s" % (line)
        else:
                    if(line.count("<Iteration>")>0):
                            genes2=[]
                    if(marker==False):
                            genes2.append(line)
                    if(marker==True):
                            if(line.count("</Iteration>")<1):
                                    print "%s" % (line)
                            elif(line.count("</Iteration>")>0):
                                    print "%s" % (line)
                                    marker=False
                    if(line.count("<Iteration_query-def>")>0):
                            split1 = line.split(">")
                            split2 = split1[1].split("<")
                            if(split2[0] in geneContigs):
                                    for i in range(len(genes2)):
                                            print "%s" % (genes2[i])
                                    marker=True
                            genes2=[]

I'm not good in Python nor BioPython, it would be great if the community could share with me your experience or suggestion.

Thanks for your time.


Viewing all articles
Browse latest Browse all 41826

Trending Articles