I want to create custom blastdb with all viruses available in the refseq. But I don't know which source files to use. My first point is:ftp://ftp.ncbi.nih.gov/genomes/Viruses/
From research I concluded that I might need the
all.fna.tar.gz
file, since it supposedly contains nucleotide information for all viruses in the refseq, however it turned out that, for example, the Bluetongue_virus_uid14938 is doesn't have an entry in this archive BUT it has a directory and respectively files if I want download the all.gbk.tar.gz archive.
So my question is which archive (file types) should I use in order to create the most complete database of viruses that are in refseq? SHould I used the fna/ffn and just concatenate the files and send them to makeblastdb OR should I manually parse the .gbk files and create fasta files out of them - involving basically extracing the respective fasta sequences from each .gbk and rebuilding the header?