Quantcast
Channel: Post Feed
Viewing all articles
Browse latest Browse all 41826

BLAST database size influence on number of significant hits

$
0
0

I have a set of gene sequences and specific sequence.

cat genes.fa
>Gene_1_chr1_1000_1200  
ACGT...
>Gene_2_chr2_3000_3400
TTAT...  

cat sequence.fa
>Searchable_sequence
ACGG...

I want to search for this specific sequence in 1. gene sequences; 2. gene flanking sequences.

Gene flanking sequence = Gene coordinates +/- gene size -- Gene flanking sequence is gene locus plus/minus gene locus size (flanking sequence fasta file is two times bigger than original gene sequences fasta).

cat gene_flanks.fa
>Gene_1_Flank5_chr1_800_1000  
CAGT...
>Gene_1_Flank3_chr1_1200_1400  
AAGT...
>Gene_2_Flank5_chr2_2600_3000
TTAT...  
>Gene_2_Flank3_chr2_3400_3800
ACAT...  

Gene database size: 2 sequences - 600 nucleotides
Flanks database size: 4 sequences - 1200 nucleotides

I use BLAST for search:

blastn  -task blastn -db genes -query $Sequence -outfmt 6 -out - | wc -l   
6
blastn  -task blastn -db flanks -query $Sequence -outfmt 6 -out - | wc -l   
2

Number of hits between original gene set and flanks set differ. My questions are:
1. Am going to use number of hits for enrichment analysis - how accurate is it to compare number of hits between databases that have different size? (evalue depends of database size and I might be getting bias because of smaller/bigger database size).
2. I want to filter BLAST hits using evalue - can I use same evalue for databases that have different size?


Viewing all articles
Browse latest Browse all 41826

Trending Articles