Extract nucleotide sequence from a fasta file

For anyone who works with nucleotide sequences, we have probably all faced this problem at one point: there is a particular sequence we would like to extract from the reference fasta file, but we don’t know how. It can be a gene structure that’s missed by the gene annotation algorithms, or a list of genes whose coordinates are available in gff form but no transcripts were given. Fortunately, there are command-line-based tools that can help.

Case study 1: Annotate un-identififed gene structure from genome

In a recent work from my group, we found a region on chromosome 6 in herbal plant Scutellaria baicalensis where there are RNA-seq read data that seemingly correspond to real gene structures, but there are no annotated genes.

Screenshot from IGV. The top track shows the leaf RNA expression data, and the bottom track displays the annotated gene structures (gff). We have four regions with a high level of RNA-seq reads mapping, but the gene annotations are missing.

We wanted to check if there are actually genes present here and characterize their functions if they exist. So, we used bedtools to extract the genomic sequence of the circled regions and predicted the transcript structure with Augustus. The reference genome sequence can be downloaded from here.

#mkdir ~/S_baicalensis
#mv ~/GWHAOTC00000000.genome.fasta.gz ~/S_baicalensis/baicalensis.fasta.gz 
#cd ~/S_baicalensis
#gunzip baicalensis.fasta.gz

#check the header of the genome file
head baicalensis.fasta

The header of the genome file is much longer than “ChrX”:

Welp, that’s still manageable. Based on the numbering, the header of chromosome 6 should start with “GWHAOTC00000006 …”. We can grep that string to check if it’s true:

grep GWHAOTC baicalensis.fasta

Great! That means chromosome 6 in this genome fasta file is called “GWHAOTC00000006”. We can specify the location of genomic region we are interested in a tab-delimited bed file.

But what is a bed file?

The term “bed” stands for “Browser Extensible Data.” A bed file typically contains a tab-delimited table with columns representing various genomic features, such as genome location, start position, end position, and additional information about the feature. The simplest bed file only contain three columns which reports the coordinate of the sequence. In this bed file, the coordinates are not precisely where the RNA-seq reads aligns, but it covers a slighly broader range.

#vi target.bed
#from left to right: genomic location, start position, end position
GWHAOTC00000006	23268000	23271000
GWHAOTC00000006	23279000	23281000
GWHAOTC00000006	23310000	23312200
GWHAOTC00000006	23316800	23318800

The sequence extraction work is performed by bedtools. It has many other cool features, but for this tutorial we only need to call the getfasta function.

#catalogue --search bedtools
#source package {ID for bedtools}

bedtools getfasta -fi baicalensis.fasta -bed target.bed > target.fa

#-fi: Input FASTA file
#-bed: BED/GFF/VCF file of ranges to extract from -fi

Alternatively, as I recently learned, samtools can also be used to extract sequences. It doesn’t work so good as bedtools when extracting multiple sequences at once, but it certainly get the job done.

#catalogue --search samtools
#source package {ID for samtools}

samtools faidx ~/baicalensis.fasta GWHAOTC00000006:23268000-23271000 >> target.fa
samtools faidx ~/baicalensis.fasta GWHAOTC00000006:23279000-23281000 >> target.fa
samtools faidx ~/baicalensis.fasta GWHAOTC00000006:23310000-23312200 >> target.fa
samtools faidx ~/baicalensis.fasta GWHAOTC00000006:23316800-23318800 >> target.fa

That’s it! The genome sequences are now stored in target.fa. As we were trying to identify the possibly present gene structures within these regions, we fed these extracted sequences into Augustus, a de novo gene annotation webpage.

Scroll down, there are also advanced features..

Submit the job, and you’ll see a detailed result display that specifies the exon locations (in relation to the submitted input file), coding sequences, and protein sequences. To obtain the predicted coding sequence, click on “graphical and text results” -> “predicted coding sequences,” and they will be displayed in fasta format. In our work, we blasted these sequences against NCBI, confirming that they are what we were looking for, and characterized their functions!

Case study 2: Extract the transcripts from gff file and genome fasta file

Still using the Scutellaria baicalensis genome as an example (download gff and reference genome sequence here), how can we convert gene coordinates, stored in a gff file, into a transcript fasta file?

We can use another handy tool called AGAT!

AGAT is another very powerful sequence analysis tool which specializes in performing analysis with gff/gtf files. In this tutorial, we will cover one specific function: agat_sp_extract_sequences.pl

#gunzip ~/GWHAOTC00000000.gff.gz
#mv ~/gunzip GWHAOTC00000000.gff ~/S_baicalensis/baicalensis.gff | cd ~/S_baicalensis
#catalogue --search AGAT
#source package {ID for AGAT}

agat_sp_extract_sequences.pl -g baicalensis.gff -f baicalensis.fasta -o S_baicalensis.cds

#-g: Input gff/gtf file
#-f: Input fasta file

This is it, super simple. There are a few other opions can be called, depending on what’s the desired output file. Two options I personally find most useful are -t and -p.

-t  or --type: Define the feature you want to extract the sequence from. Default 'cds'. Most common choice are: gene,mrna,exon,cds,trna,three_prime_utr,five_prime_utr.

-p, --protein or --aa: Will translate the extracted sequence in Amino acid. By default the codon table used is the 1 (Standard).

When in doubt, always call –help to check the manual for full list options. I hope this short tutorial was helpful for you!