I have a set of gene sequences and specific sequence.
cat genes.fa
>Gene_1_chr1_1000_1200
ACGT...
>Gene_2_chr2_3000_3400
TTAT...
cat sequence.fa
>Searchable_sequence
ACGG...
I want to search for this specific sequence in 1. gene sequences; 2. gene flanking sequences.
Gene flanking sequence = Gene coordinates +/- gene size
-- Gene flanking sequence is gene locus plus/minus gene locus size (flanking sequence fasta file is two times bigger than original gene sequences fasta).
cat gene_flanks.fa
>Gene_1_Flank5_chr1_800_1000
CAGT...
>Gene_1_Flank3_chr1_1200_1400
AAGT...
>Gene_2_Flank5_chr2_2600_3000
TTAT...
>Gene_2_Flank3_chr2_3400_3800
ACAT...
Gene database size: 2 sequences - 600 nucleotides
Flanks database size: 4 sequences - 1200 nucleotides
I use BLAST for search:
blastn -task blastn -db genes -query $Sequence -outfmt 6 -out - | wc -l
6
blastn -task blastn -db flanks -query $Sequence -outfmt 6 -out - | wc -l
2
Number of hits between original gene set and flanks set differ. My questions are:
1. Am going to use number of hits for enrichment analysis - how accurate is it to compare number of hits between databases that have different size? (evalue depends of database size and I might be getting bias because of smaller/bigger database size).
2. I want to filter BLAST hits using evalue
- can I use same evalue
for databases that have different size?