GFFcleaner
: clean up GFF files¶
Overview¶
The GFFcleaner
utility performs various manipulations on a GFF
file to “clean” it.
Usage and options¶
General usage syntax:
GFFcleaner [OPTIONS] <file>.gff
Options:
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show the help message and exit
-
-o
OUTPUT_GFF
¶ Name of output GFF file (default is
<file>_clean.gff
)
-
--prepend
=PREPEND_STR
¶ String to prepend to seqname in first column
-
--clean
¶
Perform all the ‘cleaning’ manipulations on the input data (equivalent to specifying all of
--clean-score
,--clean-replace-attributes
,--clean-exclude-attributes
and--clean-group-sgds
)
-
--clean-score
¶
Replace
Anc_*
and blanks in thescore
field with zeroes
-
--clean-replace-attributes
¶
Replace
ID
,Gene
,Parent
andName
attributes with the value of theSGD
attribute, if present
-
--clean-exclude-attributes
¶
Remove the
kaks
,kaks2
andncbi
attributes (to remove arbitrary attributes, see the--remove-attribute=...
option)
-
--clean-group-sgds
¶
Group features with the same
SGD
by adding unique numbers to theID
attributes;ID``s will have the form ``CDS:<SGD>:<n>
(wheren
is a unique number for a given SGD)
-
--report-duplicates
¶
Report duplicate
SGD
names and write list to<file>_duplicates.gff
with line numbers, chromosome, start coordinate and strand.
-
--resolve-duplicates
=MAPPING_FILE
¶ Resolve duplicate
SGD``s by matching against 'best' genes in the supplied mapping file; other non-matching genes are discarded and written to ``<file>_discarded.gff
.
-
--discard-unresolved
¶
Discard any unresolved duplicates, which are written to
<file>_unresolved.gff
.
-
--insert-missing
=GENE_FILE
¶ Insert genes from gene file with
SGD
names that don’t appear in the input GFF. IfGENE_FILE
is blank (‘=’s must still be present) then the mapping file supplied with the--resolve-duplicates
option will be used instead.
-
--add-exon-ids
¶
For exon features without an
ID
attribute, construct and insert an ID of the formexon:<Parent>:<n>
(wheren
is a unique number).
-
--add-missing-ids
¶
For features without an
ID
attribute, construct and insert a generated ID of the form<feature>:<Parent>:<n>
(wheren
is a unique number).
-
--no-percent-encoding
¶
Convert encoded attributes to the correct characters in the output GFF.
This may result in a non-cannonical GFF that can’t be read correctly by this or other programs.
-
--remove-attribute
=RM_ATTR
¶ Remove attribute
RM_ATTR
from the list of attributes for all records in the GFF file (can be specified multiple times)
-
--strict-attributes
¶
Remove attributes that don’t conform to the
KEY=VALUE
format
-
--debug
¶
Print debugging information
Output files¶
<file>_clean.gff
: ‘cleaned’ version of input<file>_duplicates.txt
: list of duplicatedSGD
names and the lines they appear on in the input file, along with chromosome, start coordinate and strand<file>_discarded.gff
: genes rejected by--resolve-duplicates
<file>_unresolved.gff
: unresolved duplicates rejected by--discard-unresolved
Usage recipe¶
The following steps outline the procedure for using the program, with each step being run on the output from the previous one:
Clean the chromosome names in the file by adding a prefix (`–prepend` option)
Creates a copy of the input file with the chromosome names updated with a specified prefix.
E.g.
--prepend=chr
will addchr
to the start of each chromosome name in the file, which is useful if the chromosome is denoted by a number and needs the prefix for consistency with a mapping file.Clean the GFF score and attribute data (`–clean` options)
The “clean” options perform the following operations:
--clean-score
: the data in the score column is cleaned up by replacingAnc_*
and blanks with ‘0’s.
The attribute field of the GFF can contain various semicolon-separated key-value pairs:
--clean-replace-attributes
: if one of these is a non-blankSGD
then theGene
,Parent
andName
values are updated to be the same as theSGD
name.--clean-exclude-attributes
: attributes calledkaks
,kaks2
andncbi
are removed (n.b. to remove arbitrary attributes, use the more general--remove-attribute=...
option).
If multiple features share the same SGD name then
--clean-replace-attributes
can result in them also sharing the same ID; to deal with this:--clean-group-sgds
: update the ID attribute to group neighbouring lines that have the sameSGD
(see SGD grouping below).A single –clean can be specified which performs all these operations automatically.
Detect duplicate SGDs (`–report-duplicates` option)
Report duplicate SGD names found in the input file.
This option writes a list of the duplicates to a ‘duplicates’ file.
It also reports the number of ‘trivial’ duplicates, i.e. lines having the same
SGD
because they are part of the same gene.Resolve duplicate SGDs using a mapping file (`–resolve-duplicates` option)
Attempt to resolve duplicates by referring to a list of “best” genes given in a mapping file. For each duplicated name the resolution procedure is:
- Find mapping gene(s) with the same name
- For each mapping gene, keep duplicates which match chromosome, strand
and which overlap with the start and end of the gene (see
Overlap criteria below). For
SGD
groups the mapping gene must overlap the whole group for it to match; mapping genes and duplicates which don’t have matches are removed from the process. - At the end of the matching procedure the duplication is resolved if
there is one
SGD
(orSGD
group) matched to one mapping gene. Otherwise the duplication remains unresolved.
When duplicates are resolved, the non-matching duplicates are discarded; otherwise by default all unresolved duplicates are kept. However if the
--discard-unresolved
option is also specified then all unresolved duplicates are removed before output; the--insert-missing
option can then be used to add them back in.Note that the
--discard-unresolved
option cannot get rid of ‘trivial’ duplicates (i.e. lines having the same SGD because they are part of the same gene).Add missing genes (`–insert-missing` option)
Adds genes from a list of “best” genes given in a mapping file which have names not found in the input GFF.
SGD grouping¶
As part of setting the ID
attribute of GFF lines, the “clean” option
also attempts to group neighbouring lines which have the same SGD
name.
The ID
attribute is updated to the form:
ID=CDS:<sgd_name>:<i>
where <sgd_name>
is a gene or transcript name (e.g. YEL0W
) and
<i>
is an integer index which starts from 1. Groupings are indicated
by subsequent lines having the same <sgd_name>
but monotonically
increasing indices, for example:
chr1 Test CDS 34525 35262 0 - 0 ID=CDS:YEL0W:1;SGD=YEL0W
chr1 Test CDS 35823 37004 0 - 0 ID=CDS:YEL0W:2;SGD=YEL0W
chr1 Test CDS 38050 38120 0 - 0 ID=CDS:YEL0W:3;SGD=YEL0W
chr1 Test CDS 39195 39569 0 - 0 ID=CDS:YEL0W:4;SGD=YEL0W
When determining a grouping the program looks ahead from each line for subsequent lines (up to five) which have the same SGD value. So groupings can also accommodate “breaks”, for example:
chr1 Test CDS 34525 35262 0 - 0 ID=CDS:YEL0W:1;SGD=YEL0W
chr1 Test CDS 35823 37004 0 - 0 ID=CDS:YEL0W:2;SGD=YEL0W
chr1 Test CDS 38050 38120 0 - 0 ID=CDS:YEL0X:1;SGD=YEL0X
chr1 Test CDS 39195 39569 0 - 0 ID=CDS:YEL0W:3;SGD=YEL0W
Mapping file format¶
The mapping file is a tab-delimited text file with lines of the form:
name chr start end strand
<name>
is used to match against the SGD
names in the input GFF file.
Overlap criteria¶
Aside from matching chromosome and strand, one of the criteria for a mapping gene to match a duplicate from the GFF file is that the two must overlap.
An overlap is counted as the duplicate from the GFF having start/end
positions such that it lies inside the start/end positions of the mapping
gene extended by 1kb i.e. between start - 1000
and end + 1000
.