gff_cleaner: clean up GFF files¶
Overview¶
The gff_cleaner utility performs various manipulations on a GFF
file to “clean” it.
Usage and options¶
General usage syntax:
gff_cleaner [OPTIONS] <file>.gff
Options:
-
--version¶ show program’s version number and exit
-
-h,--help¶ show the help message and exit
-
-oOUTPUT_GFF¶ Name of output GFF file (default is
<file>_clean.gff)
-
--prepend=PREPEND_STR¶ String to prepend to seqname in first column
-
--clean¶ Perform all the ‘cleaning’ manipulations on the input data (equivalent to specifying all of
--clean-score,--clean-replace-attributes,--clean-exclude-attributesand--clean-group-sgds)
-
--clean-score¶ Replace
Anc_*and blanks in thescorefield with zeroes
-
--clean-replace-attributes¶ Replace
ID,Gene,ParentandNameattributes with the value of theSGDattribute, if present
-
--clean-exclude-attributes¶ Remove the
kaks,kaks2andncbiattributes (to remove arbitrary attributes, see the--remove-attribute=...option)
-
--clean-group-sgds¶ Group features with the same
SGDby adding unique numbers to theIDattributes;ID``s will have the form ``CDS:<SGD>:<n>(wherenis a unique number for a given SGD)
-
--report-duplicates¶ Report duplicate
SGDnames and write list to<file>_duplicates.gffwith line numbers, chromosome, start coordinate and strand.
-
--resolve-duplicates=MAPPING_FILE¶ Resolve duplicate
SGD``s by matching against 'best' genes in the supplied mapping file; other non-matching genes are discarded and written to ``<file>_discarded.gff.
-
--discard-unresolved¶ Discard any unresolved duplicates, which are written to
<file>_unresolved.gff.
-
--insert-missing=GENE_FILE¶ Insert genes from gene file with
SGDnames that don’t appear in the input GFF. IfGENE_FILEis blank (‘=’s must still be present) then the mapping file supplied with the--resolve-duplicatesoption will be used instead.
-
--add-exon-ids¶ For exon features without an
IDattribute, construct and insert an ID of the formexon:<Parent>:<n>(wherenis a unique number).
-
--add-missing-ids¶ For features without an
IDattribute, construct and insert a generated ID of the form<feature>:<Parent>:<n>(wherenis a unique number).
-
--no-percent-encoding¶ Convert encoded attributes to the correct characters in the output GFF.
This may result in a non-cannonical GFF that can’t be read correctly by this or other programs.
-
--remove-attribute=RM_ATTR¶ Remove attribute
RM_ATTRfrom the list of attributes for all records in the GFF file (can be specified multiple times)
-
--strict-attributes¶ Remove attributes that don’t conform to the
KEY=VALUEformat
-
--debug¶ Print debugging information
Output files¶
<file>_clean.gff: ‘cleaned’ version of input<file>_duplicates.txt: list of duplicatedSGDnames and the lines they appear on in the input file, along with chromosome, start coordinate and strand<file>_discarded.gff: genes rejected by--resolve-duplicates<file>_unresolved.gff: unresolved duplicates rejected by--discard-unresolved
Usage recipe¶
The following steps outline the procedure for using the program, with each step being run on the output from the previous one:
Clean the chromosome names in the file by adding a prefix (`–prepend` option)
Creates a copy of the input file with the chromosome names updated with a specified prefix.
E.g.
--prepend=chrwill addchrto the start of each chromosome name in the file, which is useful if the chromosome is denoted by a number and needs the prefix for consistency with a mapping file.Clean the GFF score and attribute data (`–clean` options)
The “clean” options perform the following operations:
--clean-score: the data in the score column is cleaned up by replacingAnc_*and blanks with ‘0’s.
The attribute field of the GFF can contain various semicolon-separated key-value pairs:
--clean-replace-attributes: if one of these is a non-blankSGDthen theGene,ParentandNamevalues are updated to be the same as theSGDname.--clean-exclude-attributes: attributes calledkaks,kaks2andncbiare removed (n.b. to remove arbitrary attributes, use the more general--remove-attribute=...option).
If multiple features share the same SGD name then
--clean-replace-attributescan result in them also sharing the same ID; to deal with this:--clean-group-sgds: update the ID attribute to group neighbouring lines that have the sameSGD(see SGD grouping below).A single –clean can be specified which performs all these operations automatically.
Detect duplicate SGDs (`–report-duplicates` option)
Report duplicate SGD names found in the input file.
This option writes a list of the duplicates to a ‘duplicates’ file.
It also reports the number of ‘trivial’ duplicates, i.e. lines having the same
SGDbecause they are part of the same gene.Resolve duplicate SGDs using a mapping file (`–resolve-duplicates` option)
Attempt to resolve duplicates by referring to a list of “best” genes given in a mapping file. For each duplicated name the resolution procedure is:
- Find mapping gene(s) with the same name
- For each mapping gene, keep duplicates which match chromosome, strand
and which overlap with the start and end of the gene (see
Overlap criteria below). For
SGDgroups the mapping gene must overlap the whole group for it to match; mapping genes and duplicates which don’t have matches are removed from the process. - At the end of the matching procedure the duplication is resolved if
there is one
SGD(orSGDgroup) matched to one mapping gene. Otherwise the duplication remains unresolved.
When duplicates are resolved, the non-matching duplicates are discarded; otherwise by default all unresolved duplicates are kept. However if the
--discard-unresolvedoption is also specified then all unresolved duplicates are removed before output; the--insert-missingoption can then be used to add them back in.Note that the
--discard-unresolvedoption cannot get rid of ‘trivial’ duplicates (i.e. lines having the same SGD because they are part of the same gene).Add missing genes (`–insert-missing` option)
Adds genes from a list of “best” genes given in a mapping file which have names not found in the input GFF.
SGD grouping¶
As part of setting the ID attribute of GFF lines, the “clean” option
also attempts to group neighbouring lines which have the same SGD name.
The ID attribute is updated to the form:
ID=CDS:<sgd_name>:<i>
where <sgd_name> is a gene or transcript name (e.g. YEL0W) and
<i> is an integer index which starts from 1. Groupings are indicated
by subsequent lines having the same <sgd_name> but monotonically
increasing indices, for example:
chr1 Test CDS 34525 35262 0 - 0 ID=CDS:YEL0W:1;SGD=YEL0W
chr1 Test CDS 35823 37004 0 - 0 ID=CDS:YEL0W:2;SGD=YEL0W
chr1 Test CDS 38050 38120 0 - 0 ID=CDS:YEL0W:3;SGD=YEL0W
chr1 Test CDS 39195 39569 0 - 0 ID=CDS:YEL0W:4;SGD=YEL0W
When determining a grouping the program looks ahead from each line for subsequent lines (up to five) which have the same SGD value. So groupings can also accommodate “breaks”, for example:
chr1 Test CDS 34525 35262 0 - 0 ID=CDS:YEL0W:1;SGD=YEL0W
chr1 Test CDS 35823 37004 0 - 0 ID=CDS:YEL0W:2;SGD=YEL0W
chr1 Test CDS 38050 38120 0 - 0 ID=CDS:YEL0X:1;SGD=YEL0X
chr1 Test CDS 39195 39569 0 - 0 ID=CDS:YEL0W:3;SGD=YEL0W
Mapping file format¶
The mapping file is a tab-delimited text file with lines of the form:
name chr start end strand
<name> is used to match against the SGD names in the input GFF file.
Overlap criteria¶
Aside from matching chromosome and strand, one of the criteria for a mapping gene to match a duplicate from the GFF file is that the two must overlap.
An overlap is counted as the duplicate from the GFF having start/end
positions such that it lies inside the start/end positions of the mapping
gene extended by 1kb i.e. between start - 1000 and end + 1000.