I have a set of gene sequences and specific sequence.
cat genes.fa>Gene_1_chr1_1000_1200
ACGT...>Gene_2_chr2_3000_3400
TTAT...
cat sequence.fa
>Searchable_sequence
ACGG...
I want to search for this specific sequence in 1. gene sequences; 2. gene flanking sequences.
Gene flanking sequence = Gene coordinates +/- gene size
-- Gene flanking sequence is gene locus plus/minus gene locus size (flanking sequence fasta file is two times bigger than original gene sequences fasta).cat gene_flanks.fa>Gene_1_Flank5_chr1_800_1000
CAGT...>Gene_1_Flank3_chr1_1200_1400
AAGT...>Gene_2_Flank5_chr2_2600_3000
TTAT...>Gene_2_Flank3_chr2_3400_3800
ACAT...
Gene database size: 2 sequences - 600 nucleotides
Flanks database size: 4 sequences - 1200 nucleotides
I use BLAST for search:
blastn -task blastn -db genes -query $Sequence -outfmt 6 -out - | wc -l
6
blastn -task blastn -db flanks -query $Sequence -outfmt 6 -out - | wc -l
2
Number of hits between original gene set and flanks set differ. My questions are:
1. Am going to use number of hits for enrichment analysis - how accurate is it to compare number of hits between databases that have different size? (evalue depends of database size and I might be getting bias because of smaller/bigger database size).
2. I want to filter BLAST hits using evalue
- can I use same evalue
for databases that have differen ...