| Title: | Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge |
|---|---|
| Description: | Implements the Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge ('MOC-GaPBK') proposed by Parraga-Alava and others (2018) <doi:10.1186/s13040-018-0178-4>. The algorithm performs gene clustering using 'NSGA-II' as the underlying multi-objective evolutionary engine, together with Path-Relinking and Pareto Local Search as intensification and diversification strategies. Two versions of the Xie-Beni validity index are used as objective functions, one per distance matrix, so that prior biological knowledge can be incorporated through the second matrix. |
| Authors: | Jorge Parraga-Alava [aut, cre, cph] (ORCID: <https://orcid.org/0000-0001-8558-9122>), Marcio Dorn [aut], Mario Inostroza-Ponta [aut] |
| Maintainer: | Jorge Parraga-Alava <[email protected]> |
| License: | GPL-2 |
| Version: | 0.3.0 |
| Built: | 2026-05-15 19:27:46 UTC |
| Source: | https://github.com/jorgeklz/package-moc.gapbk |
A simulated gene expression dataset designed to illustrate the
multi-objective clustering capabilities of moc.gapbk.
The dataset reproduces the typical structure found in real gene
expression studies coupled with a-priori biological knowledge from
the Gene Ontology (GO).
geneexprgeneexpr
A named list with three elements:
expressionA numeric matrix of 100 rows (genes) and 20 columns (experimental conditions). Each gene belongs to one of five biological processes and follows a noisy version of that process's prototype expression pattern.
go_simA symmetric 100 by 100 matrix of simulated Gene Ontology semantic similarity values in [0, 1]. Genes within the same biological process exhibit high similarity (~0.85), genes across processes exhibit low similarity (~0.10).
processA factor of length 100 with five levels
(p1 through p5) encoding the true biological
process membership of each gene. This is the ground truth and
should be used only for evaluation, not for clustering.
The dataset was generated by data-raw/generate_geneexpr.R
using set.seed(2026). Each of the five biological processes
has a distinct temporal expression prototype (sinusoids, cosinusoids,
and a linear trend across 20 conditions), and the 20 genes belonging
to each process are sampled around the prototype with Gaussian noise
(sigma = 0.5).
The go_sim matrix simulates the kind of semantic similarity
produced by tools such as GOSemSim, where genes annotated to the
same GO terms share high pairwise similarity. To use it as a
distance, convert it via 1 - go_sim.
Typical workflow with moc.gapbk:
data(geneexpr) d_expr <- as.matrix(stats::dist(geneexpr$expression, method = "euclidean")) d_go <- 1 - geneexpr$go_sim res <- moc.gapbk(d_expr, d_go, num_k = 5)
Synthetic data generated by the script
data-raw/generate_geneexpr.R bundled with the package.
data(geneexpr) str(geneexpr) # The five biological processes table(geneexpr$process) # Build the two distance matrices required by moc.gapbk d_expr <- as.matrix(stats::dist(geneexpr$expression, method = "euclidean")) d_go <- 1 - geneexpr$go_sim # Quick clustering (low parameters for the example) set.seed(42) res <- moc.gapbk(d_expr, d_go, num_k = 5, generation = 5, pop_size = 6) head(res$matrix.solutions)data(geneexpr) str(geneexpr) # The five biological processes table(geneexpr$process) # Build the two distance matrices required by moc.gapbk d_expr <- as.matrix(stats::dist(geneexpr$expression, method = "euclidean")) d_go <- 1 - geneexpr$go_sim # Quick clustering (low parameters for the example) set.seed(42) res <- moc.gapbk(d_expr, d_go, num_k = 5, generation = 5, pop_size = 6) head(res$matrix.solutions)
Performs the MOC-GaPBK algorithm proposed by Parraga-Alava and others (2018). It receives two distance matrices and returns a set of non-dominated clustering solutions.
moc.gapbk( dmatrix1, dmatrix2, num_k, generation = 50, pop_size = 10, rat_cross = 0.8, rat_muta = 0.01, tour_size = 2, neighborhood = 0.1, local_search = FALSE, cores = 2 ) moc.gabk(...)moc.gapbk( dmatrix1, dmatrix2, num_k, generation = 50, pop_size = 10, rat_cross = 0.8, rat_muta = 0.01, tour_size = 2, neighborhood = 0.1, local_search = FALSE, cores = 2 ) moc.gabk(...)
dmatrix1 |
A square distance matrix. Must have the same
dimensions as |
dmatrix2 |
A square distance matrix. Must have the same
dimensions as |
num_k |
The number |
generation |
Number of generations to be performed. Default 50. |
pop_size |
Size of the population. Default 10. |
rat_cross |
Probability of crossover. Default 0.80. |
rat_muta |
Probability of mutation. Default 0.01. |
tour_size |
Size of the tournament for parent selection. Default 2. |
neighborhood |
Percentage of neighborhood used by Pareto Local
Search. A real value between 0 and 1. The neighborhood size is
computed as |
local_search |
Logical. If |
cores |
Number of cores used by Path-Relinking. Default 2. |
... |
Arguments passed to |
MOC-GaPBK couples NSGA-II with Path-Relinking and Pareto Local Search. Two versions of the Xie-Beni validity index are used as objectives, one per distance matrix.
moc.gabk (note the single p) is a deprecated alias kept
for backward compatibility with versions 0.1.x. New code should call
moc.gapbk directly.
A named list with three elements:
populationA data frame containing the final population of medoids together with the values of the two objective functions, the Pareto ranking and the crowding distance, ordered accordingly.
matrix.solutionsA data frame whose columns are clustering solutions on the Pareto front. Each row corresponds to an object and each cell to its assigned cluster.
clusteringA list of named integer vectors. Element
i is the partition produced by the -th solution
on the Pareto front.
Jorge Parraga-Alava, Marcio Dorn, Mario Inostroza-Ponta
J. Parraga-Alava, M. Dorn, M. Inostroza-Ponta (2018). A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Mining. 11(1) 1-16. doi:10.1186/s13040-018-0178-4.
K. Deb, A. Pratap, S. Agarwal, T. Meyarivan (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2) 182-197.
F. Glover (1997). Tabu Search and Adaptive Memory Programming Advances, Applications and Challenges. Interfaces in Computer Science and Operations Research. 1-75.
J. Dubois-Lacoste, M. Lopez-Ibanez, T. Stutzle (2015). Anytime Pareto local search. European Journal of Operational Research, 243(2) 369-385.
set.seed(1) x <- matrix(stats::runif(50 * 20, min = -5, max = 10), nrow = 50, ncol = 20) dmatrix1 <- as.matrix(stats::dist(x, method = "euclidean")) dmatrix2 <- as.matrix(stats::dist(x, method = "manhattan")) res <- moc.gapbk(dmatrix1, dmatrix2, num_k = 3, generation = 5, pop_size = 6) head(res$matrix.solutions)set.seed(1) x <- matrix(stats::runif(50 * 20, min = -5, max = 10), nrow = 50, ncol = 20) dmatrix1 <- as.matrix(stats::dist(x, method = "euclidean")) dmatrix2 <- as.matrix(stats::dist(x, method = "manhattan")) res <- moc.gapbk(dmatrix1, dmatrix2, num_k = 3, generation = 5, pop_size = 6) head(res$matrix.solutions)