Hi,
I have a single file that contains 10,000 query sequences, each 300bp long. The subject is a single chromosome that I have imported into a nucleotide database ("makeblastdb -dbtype nucl"). Due to the nature of the data, I am expecting virtually all (>99%) of the query sequences to find a strong match, but I only want to return the TOP match per query sequence. I assume that "max_hsps" and "max_target_seqs" should be the way to achieve that, but I don't seem to be getting the expected results.
If I use "max_hsps 1 max_target_seqs 1", I get 322 (unique) hits. If I use "max_hsps 1" by itself, then I get the same 322 hits. If I use "max_target_seqs 1", I get an enormous number of hits (which I could reduce by filtering by evalue, but that's not really the point - I just want the top hit). If I use no parameters, then I get a similarly enormous number of results.
It feels as though there is an error in blast where it is simply not blasting the vast majority of the sequences. I know there was a bugfix a couple of versions back that fixed something similar ..
Has anyone come across something similar? Can anyone think of anything obvious that I might be doing wrong?
I'm using blast 2.2.29, on a Mac Mini running Darwin 13.3.0.
EDIT: Just in case it's not clear, I am hoping to have up to (but probably slightly less than) 10,000 results in my output file (one per query sequence). I am using "-outfmt 10", outputting to CSV.
...
↧