Building a (customized) local blast database

Have you ever encountered this problem: there are a bunch of genes you’d like to identify their closest homolog (say, 50 genes), but submitting all of them to NCBI blast and going through the results is tedious. Or you are looking for the homolog of a specific gene in your species, but this is not a model species and relying on software with blast function takes a long time. If the answer is yes, building a local blast database might be helpful for you.

Here are a few things you can try when a local blast database in hand:

Batch process gene function annotation (blast against public record);
Search a gene homolog in an organism that has no public records (i.e, cannot be found through NCBI blast) (blast against a transcript database or a genome);
Screen through the genome to find all the copies of your target gene (blast against a genome)
Look for genes that may have been missed in transcript record (i.e, it is present in the genome but was somehow missed by transcript annotation program) (blast against a genome);
Check the uniqueness of designed primers (blast against a genome)

This tutorial will demonstrate how to build and use a local blast database with two cases. Case 1 will show how to batch process gene function annotation, and case 2 will perform primer-blast against a genome.

Case 1: Annotate gene function of Solanum lycopersicum transcirpts

1.1 Download the Refseq blast database

Search “NCBI Refseq” online. This website stores a collection of “comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins”. This sequence collection is essentially the same database behind a standard webpage NCBI blast, except once it is downloaded, you will need to update it manually. The frequency of Refseq record update depends on the species you work on. For plant and microbes, updates come with the release of every new annotated genome or submitted updates for annotated genomes. If you would like to choose a specific version of Refseq database to work on, you may need to download everything and build a version-specific one. If your work requires you to compare the blast results between different versions of the public database, it may be a good practice to label the date with every database.

When you enter the index page, find “RefSeq FTP” under “RefSeq Access” section, click into the ftp site and select “release”.

You will see a list of refseq database available under the /refseq/release directory. As I am a plant person, I will choose the plant database here for demo. When getting into /refseq/release/plant directory, there is a long list of files available to download. For our purposes, we are looking for fasta files that contain either RNA (for building a nucleotide database), genome (for a species-specific genomic database) or protein (for an amino acid database) sequences. The files to download are either ending with rna.fna.gz (.rna.fna.gz), genomic.fna.gz (.genomic.fna.gz) or protein.faa.gz (*.protein.faa.gz).

*There is a very detailed explanation on “What is the file content within each specific assembly directory“

To batch download everything, I use wget command. If you are using a cloud service or high-performance computer that is connected to the internet, wget can download the data directly from internet. However the HPC on NBI is more of an intranet, so I always download the data to my local disk and transfer them onto my HPC account manually. I still use wget, but I type them in the command prompt in windows system (search “command prompt” or “cmd” in windows and that command line interface is what you need).

*wget doesn’t seem to be a default installation on windows, but you can install it following the tutorial here

#download all RNA sequence files from the ftp site. This can take a while.
wget -nd -np -r 1 -A *.rna.fna.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/

With all the files downloaded and transferred to HPC, we can start to build the database!

1.2 Building a Refseq database

First, let’s combine all the *.rna.fna.gz files into one.

#make a new directory for the database
mkdir -p ~/Refseq_nucl 

#concatenate all the files
zcat ~/*.rna.fna.gz > ~/RefSeq_nucl/plant.rna.fna

The package you are looking for is called blast. Find blast, load it into the environment.

#catalogue --search blast
#source package {ID for blast}

Making a blast database calls the function makeblastdb to build an index:

makeblastdb -in ~/RefSeq_nucl/plant.rna.fna -dbtype nucl -parse_seqids -out ~/RefSeq_nucl/plant.rna

#-in: input file
#-dbtype: molecule type, can be 'nucl' or 'prot', depending on the data
#-parse_seqids: flag to link the sequence id with taxid
#-out: name of the blast database to be created

Although the demonstration only shows how to build a Refseq RNA database, the same method can also be applied to build a Refseq protein database, a UniProt protein database, a RNA or protein database customized to your organism or even a genome database. There are many more possibilities to explore, but the fundamentals are essentialy the same.

1.3 Batch process gene function annotation

Using the Refseq database we built, we can annotate the the gene function of S. lycopersicum transcripts. The data used for demonstration come from Ensemble and you can download the transcript data here.

#unzip the file
gunzip ~/Solanum_lycopersicum.SL3.0.cds.all.fa.gz

#check the header
head ~/Solanum_lycopersicum.SL3.0.cds.all.fa

It has a very long header, which sometimes can confuse programs. We can truncate it to keep only the gene id.

#replace the contents after "cds" with blank (cut off the string after "cds"）
sed 's/ cds.*//g' ~/Solanum_lycopersicum.SL3.0.cds.all.fa > ~/S_lycopersicum.cds

Sometimes processing the blast as a linear, single task can take very long time. By parsing one large fasta file into multiple smaller ones, and perform blast in parallel, it can speed up the process. I use seqkit to perform the parsing and gnu-parallel to run multiple blast jobs in parallel.

#catalogue --search seqkit
#source package {ID for seqkit}

#split the large file to 40 parts
seqkit split -p 40 S_lycopersicum.cds

#catalogue --search parallel
#source package {ID for gnu parallel}
ls ~/S_lycopersicum.cds.split/*.cds | parallel -j 8 --pipe 'blastn -query *.cds -out *.tab -db ~/RefSeq_nucl/plant.rna -evalue 1e-5 -outfmt "6 qseqid stitle"  -max_target_seqs 1 -num_threads 12' >> ~/S_lycopersicum.tab

The actual blastn command part in the above code is ” blastn -query *.cds -out *.tab -db ~/RefSeq_nucl/plant.rna -evalue 1e-5 -outfmt ‘6 qseqid stitle’ -max_target_seqs 1 -num_threads 12″. You can simply run the blastn command on the non-splitted transcript file without parsing the sequence, it will still work, Here is a more detailed explanation on what is each flag doing.

#-query: input sequence to be blasted against the database
#-out: blast output file
#-db: database used for blasting
#-evalue: threshold for returning a match result
#-outfmt: blastn output format In output format 6, the results is a tabular separated list. Standard full set of output include the following specifiers: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore. Here we only take qseqid (query or source (gene) sequence id) and stitle (another specifier, subject title) in the output.
#-max_target_seqs: maximum number of aligned sequence to keep
#-num_threads: Number of threads (CPUs) to use in the BLAST search

Case 2: Blast qRT-PCR primers against S. lycopersicum genome

Performing primer-blast against genome is very similar to blasting a gene against public database: first build a blast database using genome sequence file, then blast search the primer. Let’s start with setting up a genome blast database (download the genome fasta file here).

mkdir ~/Soly_genome
gunzip ~/Solanum_lycopersicum.SL3.0.dna_sm.toplevel.fa.gz 
mv ~/Solanum_lycopersicum.SL3.0.dna_sm.toplevel.fa ~/Soly_genome/Soly_genome.fa

#build a genome blast database
makeblastdb -in ~/Soly_genome/Soly_genome.fa -dbtype nucl -parse_seqids -out ~/Soly_genome/Soly.genome

The primer sequences we use come from this paper and they are flanking the qRT-PCR house-keeping reference gene SlActin (although, the cultivar we used to build the genome database is a different one..)

#cat ~/primer.fa
>ACT-q_F
GGGGGCTATGAATGCACGGT
>ACT-q_R
GGCAATGCATCAGGCACCTC

The blastn command is pretty much the same as case 1, except that -evalue flag was removed and there is a new flag, -task. The -task flag is the most important one for primer blast search. The default blastn program is called megablast, which only return search results whose minimal length of an identical match is greater than 28. This won’t work with primers, so we here specify we use blastn (minimal exact match length = 11), blastn-short (minimal exact match length = 7) or dc-megablast (minimal exact match length = 11, but allow non-consecutive match).

blastn -task blastn -query ~/primer.fa -db ~/Soly_genome/Soly.genome -out ~/primer_search_result

The result looks like this, which has listed the coordinates of blast hits, the strand and the sequence. Well done!