Package 'gclink'

Title: Gene-Cluster Discovery, Annotation and Visualization
Description: Performs end-to-end analysis of gene clusters—such as photosynthesis, carbon/nitrogen/sulfur cycling, carotenoid, antibiotic, or viral marker genes (e.g., capsid, polymerase, integrase)—from genomes and metagenomes. It parses Basic Local Alignment Search Tool (BLAST) results in tab-delimited format produced by tools like NCBI BLAST+ and Diamond BLASTp, filters Open Reading Frames (ORFs) by length, detects contiguous clusters of reference genes, optionally extracts genomic coordinates, merges functional annotations, and generates publication-ready arrow plots. The package works seamlessly with or without the coding sequences input and skips plotting when no functional groups are found. For more details see Li et al. (2023) <doi:10.1038/s41467-023-42193-7>.
Authors: Liuyang Li [aut, cre] (ORCID: <https://orcid.org/0000-0001-6004-9437>)
Maintainer: Liuyang Li <[email protected]>
License: GPL-3
Version: 1.1
Built: 2026-06-05 06:37:33 UTC
Source: https://github.com/liuyanglee/gclink

Help Index


BLASTP Results for test Proteins

Description

A dataset containing BLASTP alignment results for proteins from test genomes of bacteria, including alignment metrics.

Usage

blastp_df

Format

A data frame with multiple rows and 12 variables:

qaccver

Character. Protein query identifier.

saccver

Character. Subject protein identifier from reference databases.

pident

Numeric. Percentage identity between query and subject sequences.

length

Numeric. Length of the aligned sequence region.

mismatch

Numeric. Number of mismatches in the alignment.

gapopen

Numeric. Number of gap openings in the alignment.

qstart

Numeric. Start position in query sequence.

qend

Numeric. End position in query sequence.

sstart

Numeric. Start position in subject sequence.

send

Numeric. End position in subject sequence.

evalue

Numeric. Expect value for the alignment.

bitscore

Numeric. Bit score for the alignment.


EggNOG Functional Annotation Results

Description

A dataset containing EggNOG functional annotation results, including taxonomic classification, functional categories, and domain information.

Usage

eggnog_df

Format

A data frame with multiple rows and 21 variables:

#query

Character. Protein query identifier (e.g., "Houyiibacteriaceae–LLY-WYZ-15_3—k141_356661_13").

seed_ortholog

Character. Subject protein identifier from EggNOG database (e.g., "1168065.DOK_14784").

evalue

Numeric. Expect value for the annotation.

score

Numeric. Annotation bitscore.

eggNOG_OGs

Character.

max_annot_lvl

Character.

COG_category

Character.

Description

Character. Functional description (e.g., "Bacterial regulatory proteins, tetR family").

Preferred_name

Character. Gene name if available (e.g., "cobB").

GOs

Character.

EC

Character. Enzyme Commission number if applicable (e.g., "2.7.7.49").

KEGG_ko

Character. KEGG Orthology identifier if available (e.g., "ko:K00986").

KEGG_Pathway

Character.

KEGG_Module

Character.

KEGG_Reaction

Character.

KEGG_rclass

Character.

BRITE

Character.

KEGG_TC

Character.

CAZy

Character.

BiGG_Reaction

Character.

PFAMs

Character.


Complete Gene Clusters by Adding Missing ORFs

Description

Expands gene cluster tables to include all ORFs (annotated and hypothetical) within contigs, normalizing cluster representations for downstream analysis and plotting. Ensures consistent ORF spacing/length across clusters by inserting missing rows.

Usage

gc_add(
  Data = sbgc,
  Annotation = bin_genes,
  orf_before_first = 0,
  orf_after_last = 0,
  orf_range = "All"
)

Arguments

Data

A data.frame of annotated ORFs with required columns:

  • qaccver: ORF identifier (genome—contig_orf_position format).

  • genome, contig: Genome and contig names.

  • gene: Gene symbol (NA for hypothetical ORFs).

  • orf_position: Absolute ORF position on the contig.

  • gene_cluster: Cluster identifier.

Annotation

A data.frame of full ORF annotations (e.g., from orf_extract). Must include qaccver and orf_position.

orf_before_first

Integer. Hypothetical ORFs to insert before the first annotated ORF in each cluster (bounded by contig start). Default: 0.

orf_after_last

Integer. Hypothetical ORFs to append after the last annotated ORF (bounded by contig end). Default: 0.

orf_range

Character. Controls ORF inclusion and annotation merging:

  • "All": Include every ORF in the contig range and merge all annotations (default).

  • "OnlyAnnotated": Keep only ORFs present in Annotation and merge their annotations.

  • "IgnoreAnnotated": Include all ORFs but skip merging with Annotation.

Details

  • Hypothetical ORFs are inserted as rows with gene = NA.

  • Output is always sorted by gene_cluster and orf_position.

  • Progress messages are printed to console with timestamps.

  • Contig bounds are respected—insertions never exceed actual ORF positions in Annotation.

Value

A data.frame with one row per ORF (real/hypothetical), sorted by gene_cluster and orf_position. Added columns:

GC_orf_position

Relative position within cluster (1-indexed).

GC_present_length

Count of annotated ORFs in the cluster.

GC_absent_length

Count of inserted hypothetical ORFs.

GC_length

Total ORFs (GC_present_length + GC_absent_length).


Identify and Extract Gene Clusters from Scaled BLAST Data

Description

This function screens contigs for regions that contain a pre-defined set of “reference” genes (e.g., photosynthetic genes, viral genes) arranged in a continuous block. Contigs are first coarsely filtered by the minimum number of reference genes they carry, then finely scanned for clusters that satisfy user- defined density and contiguity criteria. Each detected cluster is returned with a unique gene_cluster identifier.

Usage

gc_cal(
  Data = bin_genes,
  in_gene_list = photosynthesis_gene_list,
  AllGeneNum = 30,
  MinConSeq = 15
)

Arguments

Data

A data frame produced by orf_extract (i.e., a scaled BLAST table). Must include the columns genome_contig, gene, and orf_position.

in_gene_list

A character vector of “reference” gene symbols (e.g., photosynthesis_gene_list) that are expected to appear in the target cluster(s).

AllGeneNum

Integer. Maximum total ORF count (annotated plus hypothetical) that the algorithm is allowed to span when defining a cluster (default: 30).

MinConSeq

Integer. Minimum number of reference genes that must be present and consecutive within the candidate cluster (default: 15). Must satisfy 1 <= MinConSeq <= AllGeneNum.

Details

  1. Coarse filter: Contigs with fewer than MinConSeq reference genes are discarded.

  2. Fine scan: For each remaining contig, the algorithm slides a window that can encompass up to AllGeneNum consecutive ORFs and retains windows that contain at least MinConSeq reference genes in uninterrupted order.

  3. Cluster labelling: Each valid cluster receives a unique ID (genome_contig---1, genome_contig---2, …).

Value

A data frame identical in structure to Data but filtered to contain only those rows that belong to valid clusters. An extra column gene_cluster (format: genome_contig---N) is added to uniquely label every cluster.


Identify Breakpoints of Gene Clusters within a Contig

Description

Internal helper used by gc_cal. Given the ordered positions of reference genes on a contig, this function returns the genomic coordinates that mark the boundary of each candidate cluster. A boundary is declared whenever the gap between two successive reference genes exceeds the maximum spacing allowed by the cluster definition (AllGeneNum - MinConSeq).

Usage

gc_cluster(Data = orf_position.tmp, AllGeneNum = 30, MinConSeq = 15)

Arguments

Data

A numeric vector (ascending order) of ORF positions that carry one of the reference genes of interest. Usually the vector orf_position.tmp created inside gc_cal.

AllGeneNum

Integer. Maximum genomic span (in ORF count) that the algorithm is allowed to cover when defining a single cluster. Passed unchanged from gc_cal.

MinConSeq

Integer. Minimum number of consecutive reference genes required to form a cluster. Passed unchanged from gc_cal.

Details

  • If the gap between two consecutive reference genes is larger than AllGeneNum - MinConSeq, the left-hand gene is recorded as the last member of the preceding cluster.

  • The final reference gene is always appended to the vector as the last boundary, ensuring the rightmost cluster is not lost.

Value

A numeric vector containing every position that marks the end of a putative gene cluster. These values are subsequently used as breakpoints to slice the contig into candidate regions in the downstream functions gc_position() and gc_range().


Plot Scaled Gene Clusters with Arrows

Description

Generates a compact, publication-ready arrow plot of gene clusters previously processed with gc_scale. Each cluster is displayed on its own horizontal track, with gene directionality, labels, and user-defined colour schemes.

Usage

gc_plot(
  data = GC_meta,
  color_theme = c("#3BAA51", "#6495ED", "#DD2421", "#EF9320", "#F8EB00", "#FF0683",
    "#956548", "grey")
)

Arguments

data

A data frame produced by gc_scale, must include the columns Pstart, Pend, Pgenome, gene_group, Pdirection, and gene_label.

color_theme

Character vector of colours, in the same order as the factor levels of gene_group. Defaults to an 8-colour palette; supply fewer or more colours as needed.

Value

A ggplot object that can be further customised or printed. The plot uses gggenes::geom_gene_arrow() and gggenes::geom_gene_label() with the following elements:

  • One facet per Pgenome (cluster).

  • Genes coloured by gene_group.

  • Arrow direction encoded by Pdirection.

  • Optional gene labels inside arrows.


Extract ORF Positions for One Specific Gene Cluster

Description

Internal helper used by gc_cal. Given the ordered positions of all reference genes on a contig (Data) and the vector of cluster breakpoints produced by gc_cluster (Cluster), this function slices Data to return the positions that belong to the EachCluster-th cluster.

Usage

gc_position(
  Data = orf_position.tmp,
  Cluster = cluster.tmp,
  EachCluster = eachcluster
)

Arguments

Data

A numeric vector (ascending order) of ORF positions that carry one of the reference genes of interest. Usually the vector orf_position.tmp created inside gc_cal.

Cluster

Numeric vector returned by gc_cluster; each element marks the last position of a candidate cluster.

EachCluster

Integer index specifying which cluster to extract (1 = first cluster, 2 = second cluster, …).

Value

A numeric vector containing the ORF positions that constitute the requested gene cluster.


Determine ORF Range for a Candidate Gene Cluster

Description

Internal helper used by gc_cal. After gc_position has isolated the ORF positions belonging to a single cluster, this function validates and trims that range so that the final span (distance between the first and last retained ORF) does not exceed AllGeneNum. The goal is to retain the largest contiguous block that still satisfies the user-defined size limit.

Usage

gc_range(Norf_position = Norf_position, AllGeneNum = 20, MinConSeq = 10)

Arguments

Norf_position

Numeric vector of ORF positions (ascending) that belong to the current candidate cluster (output from gc_position).

AllGeneNum

Integer. Maximum allowed genomic span (in ORF count) for the final cluster.

MinConSeq

Integer. Minimum number of consecutive reference genes required for the cluster.

Details

  • For every reference gene in Norf_position, the function evaluates whether a window of at least MinConSeq consecutive reference genes centred on that gene can fit within AllGeneNum consecutive ORFs.

  • Genes that pass the test are collected in retain.site.

  • The minimal and maximal positions in retain.site are then used to slice the full ORF range, guaranteeing that the final cluster length <= AllGeneNum.

Value

A numeric vector containing the final ORF positions that define the validated gene cluster. If no valid block can be produced, the vector will be empty.


Scale Gene-Cluster Coordinates for Visualization

Description

Prepares a gene‐cluster annotation table for downstream plotting by converting absolute genomic coordinates into relative positions, ensuring that every cluster starts at 0 and is oriented consistently. Hypothetical ORFs (originally labeled with NA) and missing labels are replaced with placeholders, and factor levels are set as requested.

Usage

gc_scale(GC_meta = GC_meta, levels_gene_group = levels_gene_group)

Arguments

GC_meta

A data frame containing gene-cluster information. Must include the columns gene_cluster, gene_group, gene_label, start, end, and direction (numeric: 1 for forward, -1 for reverse).

levels_gene_group

Character vector specifying the desired factor levels for gene_group. Group names should appear in the order required for plotting legends.

Details

  • Absolute start/end values are not modified; scaled values are stored in new columns (Pstart, Pend).

  • Pgenome can be swapped for any unique identifier (e.g., Genome) downstream if each genome contains only one cluster.

Value

The input data frame with the following new or overwritten columns:

gene_label

Empty string ("") if originally NA.

gene_group

Set to "hypothetical ORF" if originally NA, then coerced to a factor using levels_gene_group.

Pgenome

Factor version of gene_cluster; levels follow the order of appearance in the data.

Pstart, Pend

Relative start and end coordinates (numeric) within each cluster, scaled so that the left-most gene starts at 0.

Pdirection

Logical vector: TRUE for forward, FALSE for reverse.


KEGG Orthology (KO) Group Classification

Description

A dataset mapping KEGG Orthology (KO) identifiers to their functional groups and gene symbols.

Usage

KO_group

Format

A data frame with multiple rows and 3 variables:

gene

Character. KEGG Orthology identifier (e.g., "K04035").

gene_group

Character. Functional group (e.g., "acsF").

gene_label

Character. Gene symbol (e.g., "acsF").


Remove Length Outliers from BLAST Results

Description

Filters BLAST hits by removing ORFs whose gene (protein) length is an outlier within the corresponding gene group, as defined by the inter-quartile range (IQR). Hits whose length falls outside the interval [Q1 - down_IQR * IQR, Q3 + up_IQR * IQR] are discarded.

Usage

length_filter(Data = bin_genes, down_IQR = 1.5, up_IQR = 1.5)

Arguments

Data

A data frame containing BLAST results. Must include the columns gene (gene symbol) and length (ORF length in amino acids).

down_IQR

Numeric multiplier applied to the IQR for the lower bound (default: 1.5).

up_IQR

Numeric multiplier applied to the IQR for the upper bound (default: 1.5).

Details

  • Filtering is performed within each gene group; outliers are determined independently for every gene symbol.

  • Progress messages report the number of rows before and after filtering.

  • Missing values in length are ignored when computing quantiles.

Value

The input data frame with outlier rows removed. The returned object is ungrouped regardless of the input grouping.


Extract ORF and Genome Information from BLAST or BLASTP Results

Description

This function parses BLASTP result tables to extract structured genome, contig, ORF, and gene information from the query and subject identifiers. It is designed for downstream analyses requiring explicit separation of genome, contig, and ORF identifiers from concatenated BLAST headers.

Usage

orf_extract(bin_genes = blastp_df)

Arguments

bin_genes

A data frame containing BLASTP results with at least 2 standard columns: qaccver, saccver. the column of qaccver should include both of the genome name and predicted contig name, which is concatenated by a separator "—". for example, for the qaccver "p__Myxococcota–c__Kuafubacteria–o__Kuafubacteriales–f__Kuafubacteriaceae–GCA_016703535.1—JADJBV010000001.1_150", the genome name is "p__Myxococcota–c__Kuafubacteria–o__Kuafubacteriales–f__Kuafubacteriaceae–GCA_016703535.1", the contig name is "JADJBV010000001.1", the orf name is "JADJBV010000001.1_150", and the orf_position is "150". the column of saccver must include the gene name and may include the gene information, which are concatenated by a separator "_". for example, for the saccver "bchC_Methyloversatilis_sp_RAC08_BSY238_2447_METR", the gene name is "bchC", the gene information is Methyloversatilis_sp_RAC08_BSY238_2447_METR that can help you understand the source of gene.

Value

The original data frame with six additional columns:

genome

Genome identifier extracted from qaccver.

contig

Contig identifier extracted from qaccver.

orf

Full ORF identifier extracted from qaccver.

genome_contig

Concatenated genome and contig IDs (genome—contig).

gene

Gene symbol extracted from saccver.

orf_position

Numeric ORF position extracted from the ORF identifier.


Parse ORF Coordinates from Prodigal FASTA Headers

Description

Extracts ORF identifiers, start/end positions and strand orientation directly from the FASTA headers produced by Prodigal. The resulting table is ready for downstream gene-cluster analyses.

Usage

orf_locate(in_seq_data = seq_data)

Arguments

in_seq_data

A data frame with two columns:

SeqName

ORF identifier (Prodigal format: ⁠>ORF_id # start # end # strand # ...⁠).

Sequence

ORF sequence.

Example: "Kuafubacteriaceae--GCA_016703535.1---JADJBV010000001.1_1 # 74 # 1018 # 1 # ..." Can be imported from Prodigal FASTA using:

seq_data <- Biostrings::readBStringSet("Prodigal.fasta",format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE) %>%
  data.frame(Sequence = .) %>%
  tibble::rownames_to_column("SeqName")

Value

A data frame


Photosynthesis Gene Classification Groups

Description

A dataset mapping photosynthesis-related genes to their functional groups and subunits.

Usage

PGC_group

Format

A data frame with multiple rows and 3 variables:

gene

Character. Gene identifier (e.g., "bchB").

gene_group

Character. Functional group (e.g., "bch").

gene_label

Character. Subunit designation (e.g., "B").


Photosynthesis Gene List

Description

A comprehensive list of genes involved in photosynthetic processes.

Usage

photosynthesis_gene_list

Format

A character vector containing gene identifiers:

Each element represents a photosynthesis-related gene symbol (e.g., "acsF", "bchB").


Genomic Sequence Data with Annotations

Description

A dataset containing DNA sequences from test bacteria with detailed annotation metadata. The first column combines multiple annotation elements separated by semicolons.

Usage

seq_data

Format

A data frame with multiple rows and 2 variables:

SeqName

Character. Combined annotation fields separated by semicolons, containing:

  • ID: Sequence identifier (e.g., "1_7")

  • partial: Completion status ("00" for complete, "01" for partial)

  • start_type: Translation initiation codon (e.g., "GTG", "ATG")

  • rbs_motif: Ribosome binding site motif (e.g., "GGAG/GAGG")

  • rbs_spacer: RBS spacer length (e.g., "5-10bp")

  • gc_cont: GC content (e.g., "0.673")

Sequence

Character. DNA sequence (when available) in FASTA format