Short query sequences will sometimes give several hits to the same (often large) subject sequence. This is problematic if you ask for the 20 best hits since blastall will actually give you the best hits to 20 subject sequences. Thus using the following query sequence:
IVLFGGSAVNGLLADTWQFDGTRWQQVAAVVANTAPNTAPNHALAYDARRELMVLYGGFDPLAPTGARSDTWEFDGHTWTARTDVSDANTRYNHAMTFDAVAGRVLLVGGHANGQMPLADYFQYDGVTWTRLLLAEPPPAGG
against refseq will yield 84 hits using the following command:
> blastall -d complete.nonredundant_protein.faa -i test.faa -m 8 -b 20 -p blastp -o test.output
As stated, test.output contains 84 hits but only to 20 subject sequences. How do I get blastall to only give one hit per subject?
Furthermore blastp of the blast+ package has the same problem. Note that -num_descriptions is not possible with tabulated output and -max_target_seqs is used instead.
> blastp -db complete.nonredundant_protein.faa -out test.blastplus -max_target_seqs 20 -outfmt 6 -query test.faa