GFFUtils: utilities for handling GFF and GTF files¶
Overview¶
The GFFUtils
package provides a small set of utility programs
for working with GFF and GTF files, specifically:
gff_cleaner
: perform “cleaning” operations on a GFF filegff_annotation_extractor
: combine and annotate feature counts (e.g. fromhtseq-count
) with data from a GFF filegtf_extract
: extract selected data items from a GTF filegtf2bed
: convert GTF file to BED format
Warning
The old names for the utilities (GFFcleaner
,
GFF3_Annotation_Extractor
and GTF_extract
) are
still supported, but will be removed in a future release.
Installation¶
To install the latest version of GFFUtils
, download the latest
release from as a tar.gz
file from:
https://github.com/fls-bioinformatics-core/GFFUtils/releases
For example for version 0.10.3, download GFFUtils-0.10.3.tar.gz
.
Unpack the code using:
tar xzf GFFUtils-0.10.3.tar.gz
which will unpack into a new directory called e.g.
GFFUtils-0.10.3
.
It is recommended to install the code into a Python virtual environment, which you can create by doing:
virtualenv venv
source venv/bin/activate
pip install -r ./GFFUtils-0.10.3/requirements.txt
pip install ./GFFUtils-0.10.3/
To install the developmental code directly from GitHub:
pip install -r https://raw.githubusercontent.com/fls-bioinformatics-core/GFFUtils/devel/requirements.txt
pip install git+https://github.com/fls-bioinformatics-core/GFFUtils.git@devel
Note
GFFUtils
is currently supported under the following Python
versions:
- 2.7
- 3.5
- 3.6
- 3.7
- 3.8
but support for Python 2.7 is likely to be dropped in the near future.
Contents¶
gff_cleaner
: clean up GFF files¶
Overview¶
The gff_cleaner
utility performs various manipulations on a GFF
file to “clean” it.
Usage and options¶
General usage syntax:
gff_cleaner [OPTIONS] <file>.gff
Options:
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show the help message and exit
-
-o
OUTPUT_GFF
¶ Name of output GFF file (default is
<file>_clean.gff
)
-
--prepend
=PREPEND_STR
¶ String to prepend to seqname in first column
-
--clean
¶
Perform all the ‘cleaning’ manipulations on the input data (equivalent to specifying all of
--clean-score
,--clean-replace-attributes
,--clean-exclude-attributes
and--clean-group-sgds
)
-
--clean-score
¶
Replace
Anc_*
and blanks in thescore
field with zeroes
-
--clean-replace-attributes
¶
Replace
ID
,Gene
,Parent
andName
attributes with the value of theSGD
attribute, if present
-
--clean-exclude-attributes
¶
Remove the
kaks
,kaks2
andncbi
attributes (to remove arbitrary attributes, see the--remove-attribute=...
option)
-
--clean-group-sgds
¶
Group features with the same
SGD
by adding unique numbers to theID
attributes;ID``s will have the form ``CDS:<SGD>:<n>
(wheren
is a unique number for a given SGD)
-
--report-duplicates
¶
Report duplicate
SGD
names and write list to<file>_duplicates.gff
with line numbers, chromosome, start coordinate and strand.
-
--resolve-duplicates
=MAPPING_FILE
¶ Resolve duplicate
SGD``s by matching against 'best' genes in the supplied mapping file; other non-matching genes are discarded and written to ``<file>_discarded.gff
.
-
--discard-unresolved
¶
Discard any unresolved duplicates, which are written to
<file>_unresolved.gff
.
-
--insert-missing
=GENE_FILE
¶ Insert genes from gene file with
SGD
names that don’t appear in the input GFF. IfGENE_FILE
is blank (‘=’s must still be present) then the mapping file supplied with the--resolve-duplicates
option will be used instead.
-
--add-exon-ids
¶
For exon features without an
ID
attribute, construct and insert an ID of the formexon:<Parent>:<n>
(wheren
is a unique number).
-
--add-missing-ids
¶
For features without an
ID
attribute, construct and insert a generated ID of the form<feature>:<Parent>:<n>
(wheren
is a unique number).
-
--no-percent-encoding
¶
Convert encoded attributes to the correct characters in the output GFF.
This may result in a non-cannonical GFF that can’t be read correctly by this or other programs.
-
--remove-attribute
=RM_ATTR
¶ Remove attribute
RM_ATTR
from the list of attributes for all records in the GFF file (can be specified multiple times)
-
--strict-attributes
¶
Remove attributes that don’t conform to the
KEY=VALUE
format
-
--debug
¶
Print debugging information
Output files¶
<file>_clean.gff
: ‘cleaned’ version of input<file>_duplicates.txt
: list of duplicatedSGD
names and the lines they appear on in the input file, along with chromosome, start coordinate and strand<file>_discarded.gff
: genes rejected by--resolve-duplicates
<file>_unresolved.gff
: unresolved duplicates rejected by--discard-unresolved
Usage recipe¶
The following steps outline the procedure for using the program, with each step being run on the output from the previous one:
Clean the chromosome names in the file by adding a prefix (`–prepend` option)
Creates a copy of the input file with the chromosome names updated with a specified prefix.
E.g.
--prepend=chr
will addchr
to the start of each chromosome name in the file, which is useful if the chromosome is denoted by a number and needs the prefix for consistency with a mapping file.Clean the GFF score and attribute data (`–clean` options)
The “clean” options perform the following operations:
--clean-score
: the data in the score column is cleaned up by replacingAnc_*
and blanks with ‘0’s.
The attribute field of the GFF can contain various semicolon-separated key-value pairs:
--clean-replace-attributes
: if one of these is a non-blankSGD
then theGene
,Parent
andName
values are updated to be the same as theSGD
name.--clean-exclude-attributes
: attributes calledkaks
,kaks2
andncbi
are removed (n.b. to remove arbitrary attributes, use the more general--remove-attribute=...
option).
If multiple features share the same SGD name then
--clean-replace-attributes
can result in them also sharing the same ID; to deal with this:--clean-group-sgds
: update the ID attribute to group neighbouring lines that have the sameSGD
(see SGD grouping below).A single –clean can be specified which performs all these operations automatically.
Detect duplicate SGDs (`–report-duplicates` option)
Report duplicate SGD names found in the input file.
This option writes a list of the duplicates to a ‘duplicates’ file.
It also reports the number of ‘trivial’ duplicates, i.e. lines having the same
SGD
because they are part of the same gene.Resolve duplicate SGDs using a mapping file (`–resolve-duplicates` option)
Attempt to resolve duplicates by referring to a list of “best” genes given in a mapping file. For each duplicated name the resolution procedure is:
- Find mapping gene(s) with the same name
- For each mapping gene, keep duplicates which match chromosome, strand
and which overlap with the start and end of the gene (see
Overlap criteria below). For
SGD
groups the mapping gene must overlap the whole group for it to match; mapping genes and duplicates which don’t have matches are removed from the process. - At the end of the matching procedure the duplication is resolved if
there is one
SGD
(orSGD
group) matched to one mapping gene. Otherwise the duplication remains unresolved.
When duplicates are resolved, the non-matching duplicates are discarded; otherwise by default all unresolved duplicates are kept. However if the
--discard-unresolved
option is also specified then all unresolved duplicates are removed before output; the--insert-missing
option can then be used to add them back in.Note that the
--discard-unresolved
option cannot get rid of ‘trivial’ duplicates (i.e. lines having the same SGD because they are part of the same gene).Add missing genes (`–insert-missing` option)
Adds genes from a list of “best” genes given in a mapping file which have names not found in the input GFF.
SGD grouping¶
As part of setting the ID
attribute of GFF lines, the “clean” option
also attempts to group neighbouring lines which have the same SGD
name.
The ID
attribute is updated to the form:
ID=CDS:<sgd_name>:<i>
where <sgd_name>
is a gene or transcript name (e.g. YEL0W
) and
<i>
is an integer index which starts from 1. Groupings are indicated
by subsequent lines having the same <sgd_name>
but monotonically
increasing indices, for example:
chr1 Test CDS 34525 35262 0 - 0 ID=CDS:YEL0W:1;SGD=YEL0W
chr1 Test CDS 35823 37004 0 - 0 ID=CDS:YEL0W:2;SGD=YEL0W
chr1 Test CDS 38050 38120 0 - 0 ID=CDS:YEL0W:3;SGD=YEL0W
chr1 Test CDS 39195 39569 0 - 0 ID=CDS:YEL0W:4;SGD=YEL0W
When determining a grouping the program looks ahead from each line for subsequent lines (up to five) which have the same SGD value. So groupings can also accommodate “breaks”, for example:
chr1 Test CDS 34525 35262 0 - 0 ID=CDS:YEL0W:1;SGD=YEL0W
chr1 Test CDS 35823 37004 0 - 0 ID=CDS:YEL0W:2;SGD=YEL0W
chr1 Test CDS 38050 38120 0 - 0 ID=CDS:YEL0X:1;SGD=YEL0X
chr1 Test CDS 39195 39569 0 - 0 ID=CDS:YEL0W:3;SGD=YEL0W
Mapping file format¶
The mapping file is a tab-delimited text file with lines of the form:
name chr start end strand
<name>
is used to match against the SGD
names in the input GFF file.
Overlap criteria¶
Aside from matching chromosome and strand, one of the criteria for a mapping gene to match a duplicate from the GFF file is that the two must overlap.
An overlap is counted as the duplicate from the GFF having start/end
positions such that it lies inside the start/end positions of the mapping
gene extended by 1kb i.e. between start - 1000
and end + 1000
.
gff_annotation_extractor
: annotate gene feature data¶
Overview¶
gff_annotation_extractor
takes gene feature data (for example the
output from one or more runs of the HTSeq-count program) and combines
it with data about each feature’s parent gene from a GFF file.
By default the program takes feature data from a single tab-delimited input file where the first column contains feature IDs, and outputs an updated copy of the file with data about the feature’s parent feature and parent gene appended to each line.
In ‘htseq-count’ mode, one or more htseq-count
output files should
be provided as input; the program will write out the data about the
feature’s parent feature and parent gene appended with the counts from
each input file.
By default feature IDs from the feature data files are matched to
the first record in the input GFF where the ID
attribute of that
record is the same (a different attribute can be specified using the
-i
option). All records are considered regardless of the feature
type, unless the -t
option is used to restrict the records to
just those with the specified feature type (this may be required in
‘htseq-count’ mode).
The parent gene is located by recursively looking up records where
the ID
attribute matches the Parent
attribute, until a
gene record is found.
Note
gff_annotation_extractor
can also be used with GTF input,
in which case the feature IDs are matched using the gene_id
attribute by default. Only gene
feature types are considered
when using GTF data.
Usage and options¶
General usage syntax:
gff_annotation_extractor OPTIONS <file>.gff FEATURE_DATA
Usage in ‘htseq-count’ mode:
gff_annotation_extractor --htseq-count OPTIONS <file>.gff FEATURE_COUNTS [FEATURE_COUNTS2 ...]
Options:
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show the help message and exit
-
-o
OUT_FILE
¶ specify output file name
-
-t
FEATURE_TYPE
,
--type
=FEATURE_TYPE
¶ restrict feature records to this type when matching features from input count files; if used in conjunction with
--htseq-count
then should be the same as that specified when running htseq-count (default: include all feature records)
-
-i
ID_ATTRIBUTE
,
--id-attribute
=ID_ATTRIBUTE
¶ explicitly specify the name of the attribute to get the feature IDs from (defaults to
ID
for GFF input,gene_id
for GTF input)
-
--htseq-count
¶
htseq-count mode: input is one or more output
FEATURE_COUNT
files from thehtseq-count
program
‘htseq-count’ mode¶
To generate the feature count files using htseq-count
do e.g.:
htseq-count --type=exon -i Parent <file>.gff <file>.sam
which returns counts of each exon against the name of that exon’s parent.
gff_annotation_extractor
should then be run using the same
value for the --type
option:
gff_annotation_extractor --htseq-count --type=exon <file>.gff <counts>.out
Output files¶
gff_annotation_extractor
always produces a copy of the feature
data annotated with data for each parent gene. By default this will
be called <basename>_annot.txt
; use the -o
option to specify
a different name.
The annotation consists of the following fields:
exon_parent
: ID for the parent featurefeature_type_exon_parent
: type for the parent featuregene_ID
: ID for the gene the feature belongs togene_name
: name of the gene (from theName
attribute for GFF, orgene_name
attribute for GTF)chr
: chromosome of the genestart
: start position of the geneend
: end position of the genestrand
: strand for the genegene_length
: gene lengthlocus
: string consisting of<chr>:<start>-<end>
description
: text from the gene’sdescription
attribute
In the default mode these fields are appended to each line from
the input feature file; in ‘htseq-count’ mode each line in the
annotation file consists of these fields, with the counts from
each htseq-count
file appended.
If a parent gene cannot be located for a feature then the annotation for that feature will be empty.
In ‘htseq-count’ mode an additonal file called
<basename>_annot_stats.txt
is also produced with the counts
of “ambiguous”, “two_low_aQual” etc from each log.
Warnings and errors¶
The following is a non-exhaustive list of the warnings and errors
that gff_annotation_extractor
can produce, along with a brief
description and possible cause:
Unable to locate parent data for feature '...'
: indicates IDs in the feature files for which no matching records can be located in the input GFF. In this case the output annotation will be blank. Check that the input feature file consists of tab-delimited data.Multiple parents found on line ...
: indicates that a record matching a feature ID has aParent
attribute which contains multiple comma-separated IDs. In this case it may not be possible to locate the parent gene for the feature.No identifier attribute (...) on line ...
: indicates a record from the input GFF with noID
attribute (or custom attribute supplied via-i
option).No '...' attribute found on line ...
: indicates a record from the input GTF with nogene_id
attribute (or custom attribute supplied via-i
option).
gtf_extract
: extract data items from GTF/GFF¶
Overview¶
The gtf_extract
utility extracts selected data items from a
GTF file and output in tab-delimited format.
Note
The program can also operate on GFF files provided the --gff
option is specified.
Usage and options¶
General usage syntax:
gtf_extract OPTIONS <gft_file>
Options:
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show the help message and exit
-
-f
FEATURE_TYPE
,
--feature
=FEATURE_TYPE
¶ only extract data for lines where feature is FEATURE_TYPE
-
--fields
=FIELD_LIST
¶ comma-separated list of fields to output in tab-delimited format for each line in the GTF, e.g.
chrom,start,end
.Fields can either be a GTF field name (i.e.
chrom
,source
,feature
,start
,end
,score
,strand
andframe
), or the name of an attribute (e.g.gene_name
,gene_id
etc).Data items are output in the order they appear in
FIELD_LIST
. If a field doesn’t exist for a line then'.'
will be output as the value.
-
-o
OUTFILE
¶ write output to OUTFILE (default is to write to stdout)
-
--gff
¶
specify that the input file is GFF rather than GTF format
Output¶
The program outputs a tab-delimited line of data for each matching line
found in the input GTF file; the data items in the line are those
specified by the --fields
option (or else all data items, if no fields
were specified).
For example, for --fields=chrom,start,end,strand
, the GTF line:
chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4" ...
will produce the output:
chr1 11869 14412 +
By default the output of the program is written to stdout; use the
-o
option to direct the output to a named file instead.
gtf2bed
: convert GTF contents to BED format¶
Overview¶
gtf2bed
converts the contents of a GTF file to BED format, printing
a single line for each gene
entry in the input GTF.
Output¶
The program prints the BED file contents directly to stdout, for example:
Gnai3 3 108107280 108146146 - gene
Pbsn X 77837901 77853623 - gene
Cdc45 16 18780447 18811987 - gene
H19 7 142575529 142578143 - gene
Scml2 X 161117193 161258213 + gene
To sent this to a file use:
gtf2bed FILE.gtf > FILE.bed
If any problems are encountered then the program will print warning messages to stderr.
Extra scripts and utilities¶
Additional information¶
See http://www.sanger.ac.uk/resources/software/gff/spec.html for details of the GFF format.
Credits¶
These utilities have been developed by Peter Briggs with input from Leo Zeef, to support the activities of the Bioinformatics Core Facility (BCF) in the Faculty of Biology Medicine and Health (FBMH) at the University of Manchester (UoM).