Loading GFF3

Load the landmarks

First, load the landmark scaffolds. The repo includes a FASTA file with scaffold names only, no sequences, for this purpose. Use the FASTA loader as described in the mRNA section: you do not need to define a parent relationship. You can use the SO term contig for the type.

Load the GFF file

Consider the example GFF file below.

##gff-version 3
Contig0 FRAEX38873_v2   gene    16315   44054   .   +   .   ID=FRAEX38873_v2_000000010;Name=FRAEX38873_v2_000000010;biotype=protein_coding
Contig0 FRAEX38873_v2   mRNA    16315   44054   .   +   .   ID=FRAEX38873_v2_000000010.1;Parent=FRAEX38873_v2_000000010;Name=FRAEX38873_v2_000000010.1;biotype=protein_coding;AED=0.05
Contig0 FRAEX38873_v2   five_prime_UTR  16315   16557   .   +   .   ID=FRAEX38873_v2_000000010.1.5utr1;Parent=FRAEX38873_v2_000000010.1
Contig0 FRAEX38873_v2   exon    16315   16967   .   +   .   ID=FRAEX38873_v2_000000010.1.exon1;Parent=FRAEX38873_v2_000000010.1
Contig0 FRAEX38873_v2   CDS 16558   16967   .   +   0   ID=FRAEX38873_v2_000000010.1.cds1;Parent=FRAEX38873_v2_

The below table explains each column.

column ID explanation example value
1 seqid Name of the landmark chromosome or scaffold (not the feature itself). Contig0
2 source Program name, data source, etc FRAEX38873_v2
3 type Sequence ontology term for type_id of feature gene
4 start start of the feature. 16315
5 end end of the feature. 44054
6 score Float value or . The score, because the feature was computationally predicted. ignore. .
7 strand Can be = or -. Refers to the strand of DNA: ignore +
8 phase Can be 0, 1, 2, or . Refers to the open reading frame, you can ignore. .
9 attributes This includes the actual name for the feature that will be created (in this case FRAEX38873_v2_000000010). It also includes the Parent= tag. ID=FRAEX38873_v2_000000010;Name=FRAEX38873_v2_000000010;biotype=protein_coding

Preprocessing

Every line of the GFF file will result in a new feature. The above example will create gene, mRNA, five_prime_UTR, exon, CDS, and protein features (see below for how to skip protein creation). If you’d like to not load five_prime_UTR features, for example, delete them from the file beforehand.

The GFF Importer

First, upload the file. In order to use the GUI uploader, the file extension should be .gff or .gff3. See below for information on GFF types.

Landmark Type

The landmark is the Chado feature on which the individual features are being mapped. This is typically a scaffold, contig, or chromosome (we chose contig above). If your landmarks are not uniquely named for this organism, you can specify the type here.

Protein names

As before, you may need to specify a regexp so that your proteins are correctly linked to your mRNA. Note that if you dont specify a protein regexp, it will look for proteins that are [mrna_name]-protein. This could result in new proteins being inserted accidentally! I’ve submitted a change that will allow you to skip creating proteins in this manner, look for it soon.

A note on GFF versions

GFF files are not the most uniform files around. There are GFF, GFF2, GTF, and GFF3. The Tripal GFF loader does its best, but it was designed to work withGFF3.