I have a set of ~200 known sequences of a particular domain that I want to use to fish out similar sequences from NCBI-NT.As you can imagine, my blast results for this dataset are redundant in the sense that multiple sequences from my start set will pick up some of the same sequences. Furthermore, as I was intending upon using the GI or Accession number as the Fasta Name there is the issue that multiple domains may be found for a single GI or Accession so simply reducing by GI number will not work. Therefore, I want to develop a parsing strategy that will return a fasta file with all the significant sequences labelled according to 1) Accession Number, 2) Start and 3) End sites where there are no redundant sequences and where,I keep the longest alignment in the case that a sequence is returned on multiple blast hits.
In strategizing the best approach I can think of a few ways to achieve this results:
- Iterate through the Blast XML and make a list/dict of accession numbers with start/end locations. Use the list to find the longest sequence for each domain in an Accession-number-list. Use the new list to reiterate through the blast output for sequences to keep.(what I was planning on doing)
- write out a fasta for each accession number, align the sequences and keep only the longest sequence for each domain. then recombine.(also not too difficult)