| Title: | Gene-Cluster Discovery, Annotation and Visualization |
|---|---|
| Description: | Performs end-to-end analysis of gene clusters—such as photosynthesis, carbon/nitrogen/sulfur cycling, carotenoid, antibiotic, or viral marker genes (e.g., capsid, polymerase, integrase)—from genomes and metagenomes. It parses Basic Local Alignment Search Tool (BLAST) results in tab-delimited format produced by tools like NCBI BLAST+ and Diamond BLASTp, filters Open Reading Frames (ORFs) by length, detects contiguous clusters of reference genes, optionally extracts genomic coordinates, merges functional annotations, and generates publication-ready arrow plots. The package works seamlessly with or without the coding sequences input and skips plotting when no functional groups are found. For more details see Li et al. (2023) <doi:10.1038/s41467-023-42193-7>. |
| Authors: | Liuyang Li [aut, cre] (ORCID: <https://orcid.org/0000-0001-6004-9437>) |
| Maintainer: | Liuyang Li <[email protected]> |
| License: | GPL-3 |
| Version: | 1.1 |
| Built: | 2026-06-05 06:37:33 UTC |
| Source: | https://github.com/liuyanglee/gclink |
A dataset containing BLASTP alignment results for proteins from test genomes of bacteria, including alignment metrics.
blastp_dfblastp_df
A data frame with multiple rows and 12 variables:
Character. Protein query identifier.
Character. Subject protein identifier from reference databases.
Numeric. Percentage identity between query and subject sequences.
Numeric. Length of the aligned sequence region.
Numeric. Number of mismatches in the alignment.
Numeric. Number of gap openings in the alignment.
Numeric. Start position in query sequence.
Numeric. End position in query sequence.
Numeric. Start position in subject sequence.
Numeric. End position in subject sequence.
Numeric. Expect value for the alignment.
Numeric. Bit score for the alignment.
A dataset containing EggNOG functional annotation results, including taxonomic classification, functional categories, and domain information.
eggnog_dfeggnog_df
A data frame with multiple rows and 21 variables:
Character. Protein query identifier (e.g., "Houyiibacteriaceae–LLY-WYZ-15_3—k141_356661_13").
Character. Subject protein identifier from EggNOG database (e.g., "1168065.DOK_14784").
Numeric. Expect value for the annotation.
Numeric. Annotation bitscore.
Character.
Character.
Character.
Character. Functional description (e.g., "Bacterial regulatory proteins, tetR family").
Character. Gene name if available (e.g., "cobB").
Character.
Character. Enzyme Commission number if applicable (e.g., "2.7.7.49").
Character. KEGG Orthology identifier if available (e.g., "ko:K00986").
Character.
Character.
Character.
Character.
Character.
Character.
Character.
Character.
Character.
Expands gene cluster tables to include all ORFs (annotated and hypothetical) within contigs, normalizing cluster representations for downstream analysis and plotting. Ensures consistent ORF spacing/length across clusters by inserting missing rows.
gc_add( Data = sbgc, Annotation = bin_genes, orf_before_first = 0, orf_after_last = 0, orf_range = "All" )gc_add( Data = sbgc, Annotation = bin_genes, orf_before_first = 0, orf_after_last = 0, orf_range = "All" )
Data |
A
|
Annotation |
A |
orf_before_first |
Integer. Hypothetical ORFs to insert before the first annotated ORF
in each cluster (bounded by contig start). Default: |
orf_after_last |
Integer. Hypothetical ORFs to append after the last annotated ORF
(bounded by contig end). Default: |
orf_range |
Character. Controls ORF inclusion and annotation merging:
|
Hypothetical ORFs are inserted as rows with gene = NA.
Output is always sorted by gene_cluster and orf_position.
Progress messages are printed to console with timestamps.
Contig bounds are respected—insertions never exceed actual ORF positions in Annotation.
A data.frame with one row per ORF (real/hypothetical), sorted by gene_cluster and
orf_position. Added columns:
GC_orf_positionRelative position within cluster (1-indexed).
GC_present_lengthCount of annotated ORFs in the cluster.
GC_absent_lengthCount of inserted hypothetical ORFs.
GC_lengthTotal ORFs (GC_present_length + GC_absent_length).
This function screens contigs for regions that contain a
pre-defined set of “reference” genes (e.g., photosynthetic genes, viral genes)
arranged in a continuous block. Contigs are
first coarsely filtered by the minimum number of reference genes
they carry, then finely scanned for clusters that satisfy user-
defined density and contiguity criteria. Each detected cluster
is returned with a unique gene_cluster identifier.
gc_cal( Data = bin_genes, in_gene_list = photosynthesis_gene_list, AllGeneNum = 30, MinConSeq = 15 )gc_cal( Data = bin_genes, in_gene_list = photosynthesis_gene_list, AllGeneNum = 30, MinConSeq = 15 )
Data |
A data frame produced by |
in_gene_list |
A character vector of “reference” gene symbols (e.g.,
|
AllGeneNum |
Integer. Maximum total ORF count (annotated plus hypothetical) that the algorithm is allowed to span when defining a cluster (default: 30). |
MinConSeq |
Integer. Minimum number of reference genes that must be
present and consecutive within the candidate cluster
(default: 15). Must satisfy |
Coarse filter: Contigs with fewer than MinConSeq reference
genes are discarded.
Fine scan: For each remaining contig, the algorithm slides a
window that can encompass up to AllGeneNum consecutive ORFs
and retains windows that contain at least MinConSeq reference
genes in uninterrupted order.
Cluster labelling: Each valid cluster receives a unique ID
(genome_contig---1, genome_contig---2, …).
A data frame identical in structure to Data but filtered to
contain only those rows that belong to valid clusters. An extra
column gene_cluster (format: genome_contig---N) is added
to uniquely label every cluster.
Internal helper used by gc_cal.
Given the ordered positions of reference genes on a contig,
this function returns the genomic coordinates that mark the
boundary of each candidate cluster.
A boundary is declared whenever the gap between two successive
reference genes exceeds the maximum spacing allowed by the
cluster definition (AllGeneNum - MinConSeq).
gc_cluster(Data = orf_position.tmp, AllGeneNum = 30, MinConSeq = 15)gc_cluster(Data = orf_position.tmp, AllGeneNum = 30, MinConSeq = 15)
Data |
A numeric vector (ascending order) of ORF positions that carry
one of the reference genes of interest. Usually the vector
|
AllGeneNum |
Integer. Maximum genomic span (in ORF count) that the
algorithm is allowed to cover when defining a single
cluster. Passed unchanged from |
MinConSeq |
Integer. Minimum number of consecutive reference genes
required to form a cluster. Passed unchanged from
|
If the gap between two consecutive reference genes is larger than
AllGeneNum - MinConSeq, the left-hand gene is recorded as the
last member of the preceding cluster.
The final reference gene is always appended to the vector as the last boundary, ensuring the rightmost cluster is not lost.
A numeric vector containing every position that marks the end
of a putative gene cluster. These values are subsequently used as
breakpoints to slice the contig into candidate regions in the
downstream functions gc_position() and gc_range().
Generates a compact, publication-ready arrow plot of gene
clusters previously processed with gc_scale.
Each cluster is displayed on its own horizontal track, with
gene directionality, labels, and user-defined colour schemes.
gc_plot( data = GC_meta, color_theme = c("#3BAA51", "#6495ED", "#DD2421", "#EF9320", "#F8EB00", "#FF0683", "#956548", "grey") )gc_plot( data = GC_meta, color_theme = c("#3BAA51", "#6495ED", "#DD2421", "#EF9320", "#F8EB00", "#FF0683", "#956548", "grey") )
data |
A data frame produced by |
color_theme |
Character vector of colours, in the same order as the
factor levels of |
A ggplot object that can be further customised or printed.
The plot uses gggenes::geom_gene_arrow() and
gggenes::geom_gene_label() with the following elements:
One facet per Pgenome (cluster).
Genes coloured by gene_group.
Arrow direction encoded by Pdirection.
Optional gene labels inside arrows.
Internal helper used by gc_cal.
Given the ordered positions of all reference genes on a contig
(Data) and the vector of cluster breakpoints produced by
gc_cluster (Cluster), this function slices
Data to return the positions that belong to the
EachCluster-th cluster.
gc_position( Data = orf_position.tmp, Cluster = cluster.tmp, EachCluster = eachcluster )gc_position( Data = orf_position.tmp, Cluster = cluster.tmp, EachCluster = eachcluster )
Data |
A numeric vector (ascending order) of ORF positions that carry
one of the reference genes of interest. Usually the vector
|
Cluster |
Numeric vector returned by |
EachCluster |
Integer index specifying which cluster to extract (1 = first cluster, 2 = second cluster, …). |
A numeric vector containing the ORF positions that constitute the requested gene cluster.
Internal helper used by gc_cal.
After gc_position has isolated the ORF positions
belonging to a single cluster, this function validates and
trims that range so that the final span (distance between the
first and last retained ORF) does not exceed AllGeneNum.
The goal is to retain the largest contiguous block that still
satisfies the user-defined size limit.
gc_range(Norf_position = Norf_position, AllGeneNum = 20, MinConSeq = 10)gc_range(Norf_position = Norf_position, AllGeneNum = 20, MinConSeq = 10)
Norf_position |
Numeric vector of ORF positions (ascending) that belong
to the current candidate cluster (output from
|
AllGeneNum |
Integer. Maximum allowed genomic span (in ORF count) for the final cluster. |
MinConSeq |
Integer. Minimum number of consecutive reference genes required for the cluster. |
For every reference gene in Norf_position, the function evaluates
whether a window of at least MinConSeq consecutive reference genes
centred on that gene can fit within AllGeneNum consecutive ORFs.
Genes that pass the test are collected in retain.site.
The minimal and maximal positions in retain.site are then
used to slice the full ORF range, guaranteeing that the final cluster
length <= AllGeneNum.
A numeric vector containing the final ORF positions that define the validated gene cluster. If no valid block can be produced, the vector will be empty.
Prepares a gene‐cluster annotation table for downstream plotting by
converting absolute genomic coordinates into relative positions,
ensuring that every cluster starts at 0 and is oriented consistently.
Hypothetical ORFs (originally labeled with NA) and missing
labels are replaced with placeholders, and factor levels are set
as requested.
gc_scale(GC_meta = GC_meta, levels_gene_group = levels_gene_group)gc_scale(GC_meta = GC_meta, levels_gene_group = levels_gene_group)
GC_meta |
A data frame containing gene-cluster information. Must include
the columns |
levels_gene_group |
Character vector specifying the desired factor levels
for |
Absolute start/end values are not modified; scaled
values are stored in new columns (Pstart, Pend).
Pgenome can be swapped for any unique identifier (e.g., Genome)
downstream if each genome contains only one cluster.
The input data frame with the following new or overwritten columns:
gene_labelEmpty string ("") if originally NA.
gene_groupSet to "hypothetical ORF" if originally
NA, then coerced to a factor using
levels_gene_group.
PgenomeFactor version of gene_cluster; levels
follow the order of appearance in the data.
Pstart, Pend
Relative start and end coordinates (numeric) within each cluster, scaled so that the left-most gene starts at 0.
PdirectionLogical vector: TRUE for forward,
FALSE for reverse.
gclink() performs all steps from raw blastp output to a publication-ready
gene-cluster plot in one call. It dynamically adapts its workflow:
If in_seq_data is provided: Extracts coordinates/sequences, merges data, and generates plots.
If in_seq_data is NULL: Skips sequence-dependent steps, returning only the cluster table.
If in_GC_group is NULL: Omits functional-group merging and plotting.
Supports custom gene lists (e.g., photosynthesis, viral genes) via in_gene_list for universal
cluster detection in bacteria, archaea, or phages.
Gene clusters are identified via a density threshold (AllGeneNum and MinConSeq) due to
incomplete gene annotation coverage.
gclink( in_blastp_df = blastp_df, in_seq_data = seq_data, in_gene_list = photosynthesis_gene_list, in_GC_group = PGC_group, AllGeneNum = 50, MinConSeq = 25, apply_length_filter = FALSE, down_IQR = 10, up_IQR = 10, apply_evalue_filter = FALSE, min_evalue = 1, apply_score_filter = FALSE, min_score = 10, orf_before_first = 0, orf_after_last = 0, orf_range = "All", levels_gene_group = c("bch", "puh", "puf", "crt", "acsF", "assembly", "regulator", "hypothetical ORF"), color_theme = c("#3BAA51", "#6495ED", "#DD2421", "#EF9320", "#F8EB00", "#FF0683", "#956548", "grey"), genome_subset = NULL )gclink( in_blastp_df = blastp_df, in_seq_data = seq_data, in_gene_list = photosynthesis_gene_list, in_GC_group = PGC_group, AllGeneNum = 50, MinConSeq = 25, apply_length_filter = FALSE, down_IQR = 10, up_IQR = 10, apply_evalue_filter = FALSE, min_evalue = 1, apply_score_filter = FALSE, min_score = 10, orf_before_first = 0, orf_after_last = 0, orf_range = "All", levels_gene_group = c("bch", "puh", "puf", "crt", "acsF", "assembly", "regulator", "hypothetical ORF"), color_theme = c("#3BAA51", "#6495ED", "#DD2421", "#EF9320", "#F8EB00", "#FF0683", "#956548", "grey"), genome_subset = NULL )
in_blastp_df |
A
eggNOG (evolutionary gene genealogy Nonsupervised Orthologous Groups) format results
are supported by renaming annotation columns (e.g., |
in_seq_data |
Example:
seq_data <- Biostrings::readBStringSet("Prodigal.fasta",format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE) %>%
data.frame(Sequence = .) %>%
tibble::rownames_to_column("SeqName")
|
in_gene_list |
Character vector of reference genes for cluster detection.
Default: |
in_GC_group |
|
AllGeneNum |
Integer; max ORFs per cluster. Default: |
MinConSeq |
Integer; min consecutive reference genes per cluster. Default: |
apply_length_filter |
Logical; whether to apply length filtering based on IQR.
If |
down_IQR |
Numeric; lower-bound scale factor for IQR length
filtering (see |
up_IQR |
Numeric; upper-bound scale factor for IQR length
filtering (see |
apply_evalue_filter |
Logical; whether to filter BLAST hits by e-value cutoff.
Requires column |
min_evalue |
Numeric; maximum e-value threshold for retaining BLAST hits.
Hits with e-value > cutoff will be removed.
Typical values: 1e-3 (loose) to 1e-10 (strict).
Default: |
apply_score_filter |
Logical; whether to filter BLAST hits by bitscore cutoff.
Requires column |
min_score |
Numeric; minimum bitscore threshold for retaining BLAST hits.
Hits with bitscore < cutoff will be removed.
For short proteins (~100 aa), 30-40 is typical;
for long proteins (>300 aa), 50-100 is recommended.
Default: |
orf_before_first |
Integer; number of hypothetical ORFs to insert
before the first gene of each detected cluster.
Useful when the upstream annotation is incomplete or
low-confidence; insertion is bounded by the first ORF
of the contig. Default: |
orf_after_last |
Integer; number of hypothetical ORFs to append
after the last gene of each detected cluster.
Helpful when the downstream annotation is incomplete
or low-confidence; insertion is bounded by the last
ORF of the contig. Default: |
orf_range |
Character. ORF inclusion mode:
|
levels_gene_group |
Character vector; factor levels for gene-group
legends (must include "hypothetical ORF" in case
some genes remain unclassified).
Ignored if |
color_theme |
Character vector; colours for |
genome_subset |
Character vector or |
A named list with:
GC_metaAnnotated cluster table (data.frame).
GC_seqFASTA sequences (if in_seq_data provided).
GC_plotggplot object (if in_seq_data and in_GC_group provided).
# Case 1: Using blastp result with Full pipeline (Find Cluster + Extract FASTA + Plot Cluster) data(blastp_df) data(seq_data) data(photosynthesis_gene_list) data(PGC_group) gc_list <- gclink(in_blastp_df = blastp_df, in_seq_data = seq_data, in_gene_list = photosynthesis_gene_list, in_GC_group = PGC_group, AllGeneNum = 50, MinConSeq = 25, apply_length_filter = TRUE, down_IQR = 10, up_IQR = 10, orf_before_first = 0, orf_after_last = 0, levels_gene_group = c('bch','puh','puf','crt','acsF','assembly','regulator', 'hypothetical ORF'), color_theme = c('#3BAA51','#6495ED','#DD2421','#EF9320','#F8EB00', '#FF0683','#956548','grey'), genome_subset = NULL) gc_meta = gc_list[["GC_meta"]] gc_seq = gc_list[["GC_seq"]] gc_plot = gc_list[["GC_plot"]] head(gc_meta) head(gc_seq) print(gc_plot) # Case 2: Using eggNOG result with Full pipeline (Find Cluster + Extract FASTA + Plot Cluster) data(eggnog_df) data(seq_data) data(KO_group) KOs = c("K02291","K09844","K20611","K13789", "K09846","K08926","K08927","K08928", "K08929","K13991","K04035","K04039", "K11337","K03404","K11336","K04040", "K03403","K03405","K04037","K03428", "K04038","K06049","K10960","K11333", "K11334","K11335","K08226","K08226", "K09773") rename_KOs = paste0("ko:", KOs) eggnog_df$qaccver = eggnog_df$`#query` eggnog_df$saccver = eggnog_df$KEGG_ko eggnog_df$evalue = eggnog_df$evalue eggnog_df$bitscore = eggnog_df$score eggnog_df$gene = eggnog_df$KEGG_ko gc_list_2 = gclink(in_blastp_df = eggnog_df, in_seq_data = seq_data, in_gene_list = rename_KOs, in_GC_group = KO_group, AllGeneNum = 50, MinConSeq = 25, apply_evalue_filter = FALSE, min_evalue = 1, apply_score_filter = TRUE, min_score = 10, orf_before_first = 1, orf_after_last = 1, levels_gene_group = c('bch','puh','puf','crt', 'acsF','assembly','hypothetical ORF'), color_theme = c('#3BAA51','#6495ED','#DD2421','#EF9320', '#F8EB00','#FF0683','grey')) gc_meta_2 = gc_list_2[["GC_meta"]] gc_seq_2 = gc_list_2[["GC_seq"]] gc_plot_2 = gc_list_2[["GC_plot"]] head(gc_meta_2) head(gc_seq_2) print(gc_plot_2) # SOLUTION FOR "Viewport has zero dimension(s)" ERROR in print(gc_plot) # -------------------------------------------------- # When you see this error, try these 3 solutions: # 1. RESIZE RSTUDIO PLOT PANEL (Quickest fix) # Simply click and drag the right border of the plot panel wider # 2. LAUNCH IN NEW WINDOW # dev.new() # print(gc_plot) # 3. SAVE DIRECTLY TO FILE # ggplot2::ggsave('gc_plot.pdf', gc_plot, width=20,height=3)# Case 1: Using blastp result with Full pipeline (Find Cluster + Extract FASTA + Plot Cluster) data(blastp_df) data(seq_data) data(photosynthesis_gene_list) data(PGC_group) gc_list <- gclink(in_blastp_df = blastp_df, in_seq_data = seq_data, in_gene_list = photosynthesis_gene_list, in_GC_group = PGC_group, AllGeneNum = 50, MinConSeq = 25, apply_length_filter = TRUE, down_IQR = 10, up_IQR = 10, orf_before_first = 0, orf_after_last = 0, levels_gene_group = c('bch','puh','puf','crt','acsF','assembly','regulator', 'hypothetical ORF'), color_theme = c('#3BAA51','#6495ED','#DD2421','#EF9320','#F8EB00', '#FF0683','#956548','grey'), genome_subset = NULL) gc_meta = gc_list[["GC_meta"]] gc_seq = gc_list[["GC_seq"]] gc_plot = gc_list[["GC_plot"]] head(gc_meta) head(gc_seq) print(gc_plot) # Case 2: Using eggNOG result with Full pipeline (Find Cluster + Extract FASTA + Plot Cluster) data(eggnog_df) data(seq_data) data(KO_group) KOs = c("K02291","K09844","K20611","K13789", "K09846","K08926","K08927","K08928", "K08929","K13991","K04035","K04039", "K11337","K03404","K11336","K04040", "K03403","K03405","K04037","K03428", "K04038","K06049","K10960","K11333", "K11334","K11335","K08226","K08226", "K09773") rename_KOs = paste0("ko:", KOs) eggnog_df$qaccver = eggnog_df$`#query` eggnog_df$saccver = eggnog_df$KEGG_ko eggnog_df$evalue = eggnog_df$evalue eggnog_df$bitscore = eggnog_df$score eggnog_df$gene = eggnog_df$KEGG_ko gc_list_2 = gclink(in_blastp_df = eggnog_df, in_seq_data = seq_data, in_gene_list = rename_KOs, in_GC_group = KO_group, AllGeneNum = 50, MinConSeq = 25, apply_evalue_filter = FALSE, min_evalue = 1, apply_score_filter = TRUE, min_score = 10, orf_before_first = 1, orf_after_last = 1, levels_gene_group = c('bch','puh','puf','crt', 'acsF','assembly','hypothetical ORF'), color_theme = c('#3BAA51','#6495ED','#DD2421','#EF9320', '#F8EB00','#FF0683','grey')) gc_meta_2 = gc_list_2[["GC_meta"]] gc_seq_2 = gc_list_2[["GC_seq"]] gc_plot_2 = gc_list_2[["GC_plot"]] head(gc_meta_2) head(gc_seq_2) print(gc_plot_2) # SOLUTION FOR "Viewport has zero dimension(s)" ERROR in print(gc_plot) # -------------------------------------------------- # When you see this error, try these 3 solutions: # 1. RESIZE RSTUDIO PLOT PANEL (Quickest fix) # Simply click and drag the right border of the plot panel wider # 2. LAUNCH IN NEW WINDOW # dev.new() # print(gc_plot) # 3. SAVE DIRECTLY TO FILE # ggplot2::ggsave('gc_plot.pdf', gc_plot, width=20,height=3)
A dataset mapping KEGG Orthology (KO) identifiers to their functional groups and gene symbols.
KO_groupKO_group
A data frame with multiple rows and 3 variables:
Character. KEGG Orthology identifier (e.g., "K04035").
Character. Functional group (e.g., "acsF").
Character. Gene symbol (e.g., "acsF").
Filters BLAST hits by removing ORFs whose gene (protein) length
is an outlier within the corresponding gene group, as defined
by the inter-quartile range (IQR). Hits whose length falls
outside the interval
[Q1 - down_IQR * IQR, Q3 + up_IQR * IQR] are discarded.
length_filter(Data = bin_genes, down_IQR = 1.5, up_IQR = 1.5)length_filter(Data = bin_genes, down_IQR = 1.5, up_IQR = 1.5)
Data |
A data frame containing BLAST results. Must include the columns
|
down_IQR |
Numeric multiplier applied to the IQR for the lower bound (default: 1.5). |
up_IQR |
Numeric multiplier applied to the IQR for the upper bound (default: 1.5). |
Filtering is performed within each gene group; outliers are determined independently for every gene symbol.
Progress messages report the number of rows before and after filtering.
Missing values in length are ignored when computing quantiles.
The input data frame with outlier rows removed. The returned object is ungrouped regardless of the input grouping.
This function parses BLASTP result tables to extract structured genome, contig, ORF, and gene information from the query and subject identifiers. It is designed for downstream analyses requiring explicit separation of genome, contig, and ORF identifiers from concatenated BLAST headers.
orf_extract(bin_genes = blastp_df)orf_extract(bin_genes = blastp_df)
bin_genes |
A data frame containing BLASTP results with at least 2 standard columns:
|
The original data frame with six additional columns:
genomeGenome identifier extracted from qaccver.
contigContig identifier extracted from qaccver.
orfFull ORF identifier extracted from qaccver.
genome_contigConcatenated genome and contig IDs (genome—contig).
geneGene symbol extracted from saccver.
orf_positionNumeric ORF position extracted from the ORF identifier.
Extracts ORF identifiers, start/end positions and strand orientation directly from the FASTA headers produced by Prodigal. The resulting table is ready for downstream gene-cluster analyses.
orf_locate(in_seq_data = seq_data)orf_locate(in_seq_data = seq_data)
in_seq_data |
A data frame with two columns:
Example:
seq_data <- Biostrings::readBStringSet("Prodigal.fasta",format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE) %>%
data.frame(Sequence = .) %>%
tibble::rownames_to_column("SeqName")
|
A data frame
A dataset mapping photosynthesis-related genes to their functional groups and subunits.
PGC_groupPGC_group
A data frame with multiple rows and 3 variables:
Character. Gene identifier (e.g., "bchB").
Character. Functional group (e.g., "bch").
Character. Subunit designation (e.g., "B").
A comprehensive list of genes involved in photosynthetic processes.
photosynthesis_gene_listphotosynthesis_gene_list
A character vector containing gene identifiers:
Each element represents a photosynthesis-related gene symbol (e.g., "acsF", "bchB").
A dataset containing DNA sequences from test bacteria with detailed annotation metadata. The first column combines multiple annotation elements separated by semicolons.
seq_dataseq_data
A data frame with multiple rows and 2 variables:
Character. Combined annotation fields separated by semicolons, containing:
ID: Sequence identifier (e.g., "1_7")
partial: Completion status ("00" for complete, "01" for partial)
start_type: Translation initiation codon (e.g., "GTG", "ATG")
rbs_motif: Ribosome binding site motif (e.g., "GGAG/GAGG")
rbs_spacer: RBS spacer length (e.g., "5-10bp")
gc_cont: GC content (e.g., "0.673")
Character. DNA sequence (when available) in FASTA format