Hi there!
This is my first post on your forum and I hope you'll be able to help me.
This is my first intership in bioinformatics and I'm kind of learning everything by myself. I learnt how to use python 2 days ago.
I'm actually working with a large dataset (200Go) of metagenomics sequences. My boss asked me to blast them against the 16SMicrobial NCBI database and then "copy/paste" in a new file the sequences with hits, assuming they are 16S related. Easier to say than to do when you blast about 413000 sequences.
So I decided to try to make a script with Python that would do it. I also installed Biopython because it seemed to be a very good tool.
My first idea is to use a script i found on the Biopython website. this script is used to retrieve sequences with no blast hits. This would at least help to separate sequences i don't from sequences i need.
from Bio import SeqIO
from Bio.Blast import NCBIXML
q_dict = SeqIO.to_dict(SeqIO.parse(open("queries.fasta"), "fasta"))
hits = []
for record in NCBIXML.parse(open("BLAST_RESULTS.xml")):
# As of BLAST 2.2.19 the xml output for multiple-query blast searches
# skips queries with no hits so we could just append the ID of every blast
# record to our 'hit list'. It's possible that the NCBI will change this
# behaviour in the future so let's make the appending conditional on there
# being some hits (ie, a non-empty alignments list) recorded in the blast record
if record.alignments:
# The blast record's ...