moc.gapbk implements the Multi-Objective
Clustering Algorithm Guided by a-Priori Biological Knowledge
(MOC-GaPBK) introduced by Parraga-Alava and colleagues (2018). The
algorithm couples the well-known NSGA-II evolutionary engine with two
memetic strategies, Path-Relinking and Pareto
Local Search, and minimises two Xie-Beni cluster validity
indices simultaneously, one per distance matrix.
The two-objective formulation is what makes this algorithm useful for gene expression clustering: one matrix typically captures the expression patterns (e.g. Euclidean distance on expression values) and the other encodes a-priori biological knowledge (e.g. Gene Ontology semantic similarity). The algorithm returns the full Pareto front of clustering solutions, so the analyst can choose a trade-off explicitly rather than committing to a single weighted aggregate.
This vignette walks through:
geneexpr dataset.moc.gapbk without local search.geneexpr datasetThe package ships with a small synthetic dataset designed to mimic the two information sources that a real gene expression study would provide.
library(moc.gapbk)
data(geneexpr)
str(geneexpr, max.level = 1)
#> List of 3
#> $ expression: num [1:100, 1:20] 0.2603 -0.5398 0.0696 -0.0424 -0.3333 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ go_sim : num [1:100, 1:100] 1 0.895 0.792 0.851 0.786 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ process : Factor w/ 5 levels "p1","p2","p3",..: 1 1 1 1 1 1 1 1 1 1 ...The dataset contains three elements:
geneexpr$expression: a 100 by 20 matrix of expression
values for 100 genes measured under 20 conditions.geneexpr$go_sim: a 100 by 100 matrix of simulated Gene
Ontology semantic similarity values in [0, 1].geneexpr$process: the ground-truth biological process
membership of each gene (5 processes, 20 genes per process).A quick look at the structure of the expression patterns by process:
op <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
image(t(geneexpr$expression[order(geneexpr$process), ]),
col = hcl.colors(50, "RdBu", rev = TRUE),
xlab = "Condition", ylab = "Gene (sorted by process)",
main = "Expression matrix")
image(geneexpr$go_sim[order(geneexpr$process),
order(geneexpr$process)],
col = hcl.colors(50, "YlOrRd", rev = TRUE),
xlab = "Gene", ylab = "Gene",
main = "GO semantic similarity")The block-diagonal structure of the GO similarity matrix reflects the fact that genes within the same biological process share more semantic content than genes from different processes.
moc.gapbk requires two square distance matrices of
identical shape. The first matrix is built from the expression values
using a standard Euclidean distance. The second matrix is built from the
GO semantic similarity by taking the complement.
d_expr <- as.matrix(stats::dist(geneexpr$expression, method = "euclidean"))
d_go <- 1 - geneexpr$go_sim
dim(d_expr)
#> [1] 100 100
dim(d_go)
#> [1] 100 100Both matrices have shape 100 by 100, one row and column per gene.
A first call without local search uses the NSGA-II engine alone. This is the fastest configuration and serves as a baseline.
set.seed(2026)
res_baseline <- moc.gapbk(
dmatrix1 = d_expr,
dmatrix2 = d_go,
num_k = 5,
generation = 30,
pop_size = 12,
local_search = FALSE
)The returned object has three elements:
population: the final NSGA-II population with
objectives, Pareto rank and crowding distance.matrix.solutions: one column per non-dominated
solution; rows are cluster assignments.clustering: a list version of the same partitions.head(res_baseline$population)
#> V1 V2 V3 V4 V5 obj1 obj2 paretoranking crowding
#> 1 66 36 93 47 5 7.105700 4.480562 1 Inf
#> 2 3 61 25 43 87 5.584935 4.812554 1 Inf
#> 13 3 45 25 77 87 5.806474 4.724716 1 0.039417336
#> 4 3 61 25 50 87 5.702521 4.793325 1 0.005283226
#> 6 3 45 25 61 87 5.638220 4.793627 1 0.002990073
#> 18 3 61 25 58 87 5.603635 4.795165 1 0.001095566Enabling local_search = TRUE activates Path-Relinking
(PR) and Pareto Local Search (PLS). The version 0.3.0 of the package
vectorises and batches both routines, so the cost is much lower than it
was in earlier releases.
When ggplot2 and patchwork are available,
the two fronts can be plotted side by side for direct visual
comparison.
library(ggplot2)
library(patchwork)
prep <- function(res, label) {
pop <- res$population
data.frame(
obj1 = pop$obj1,
obj2 = pop$obj2,
pareto = pop$paretoranking == 1,
modo = label
)
}
df_base <- prep(res_baseline, "NSGA-II only")
df_enh <- prep(res_enhanced, "NSGA-II + PR + PLS")
tema <- theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"),
legend.position = "top",
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA))
plot_front <- function(df, titulo, col_pareto) {
pareto <- df[df$pareto, ]
pareto <- pareto[order(pareto$obj1), ]
ggplot(df, aes(x = obj1, y = obj2)) +
geom_point(data = df[!df$pareto, ],
color = "grey75", size = 2.2, alpha = 0.6) +
geom_step(data = pareto, direction = "vh",
color = col_pareto, linewidth = 0.7, alpha = 0.6) +
geom_point(data = pareto, color = col_pareto, size = 3.3) +
geom_point(data = pareto, shape = 21, fill = NA, color = "white",
size = 4, stroke = 0.7) +
labs(title = titulo,
subtitle = paste0("Non-dominated: ", nrow(pareto),
" | Total: ", nrow(df)),
x = expression(paste("Objective 1: Xie-Beni on ", bold(D)[expr])),
y = expression(paste("Objective 2: Xie-Beni on ", bold(D)[GO]))) +
tema
}
p1 <- plot_front(df_base, "NSGA-II only", "#2E86AB")
p2 <- plot_front(df_enh, "NSGA-II + PR + PLS", "#C73E1D")
p1 + p2 + plot_annotation(
title = "Pareto fronts on the geneexpr dataset",
subtitle = "Effect of activating Path-Relinking and Pareto Local Search",
theme = theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "grey30"))
)A second view overlays both fronts so they can be compared directly:
df_all <- rbind(df_base, df_enh)
df_front <- df_all[df_all$pareto, ]
df_front <- df_front[order(df_front$modo, df_front$obj1), ]
ggplot(df_front, aes(x = obj1, y = obj2,
color = modo, fill = modo, shape = modo)) +
geom_step(direction = "vh", linewidth = 0.8, alpha = 0.7) +
geom_point(size = 3.5, alpha = 0.95) +
geom_point(shape = 21, fill = NA, color = "white", size = 4.2,
stroke = 0.7) +
scale_color_manual(values = c("NSGA-II only" = "#2E86AB",
"NSGA-II + PR + PLS" = "#C73E1D")) +
scale_fill_manual(values = c("NSGA-II only" = "#2E86AB",
"NSGA-II + PR + PLS" = "#C73E1D")) +
scale_shape_manual(values = c("NSGA-II only" = 16,
"NSGA-II + PR + PLS" = 17)) +
labs(title = "Overlay of Pareto fronts",
subtitle = "A front closer to the origin and with more points is better",
x = expression(paste("Objective 1: Xie-Beni on ", bold(D)[expr])),
y = expression(paste("Objective 2: Xie-Beni on ", bold(D)[GO])),
color = "Strategy", fill = "Strategy", shape = "Strategy") +
temaWe can also tabulate the size and reach of each front:
resumen <- data.frame(
Strategy = c("NSGA-II only", "NSGA-II + PR + PLS"),
Non_dominated = c(sum(df_base$pareto), sum(df_enh$pareto)),
Min_obj1 = round(c(min(df_base$obj1[df_base$pareto]),
min(df_enh$obj1[df_enh$pareto])), 6),
Min_obj2 = round(c(min(df_base$obj2[df_base$pareto]),
min(df_enh$obj2[df_enh$pareto])), 6),
Range_obj1 = round(c(diff(range(df_base$obj1[df_base$pareto])),
diff(range(df_enh$obj1[df_enh$pareto]))), 6),
Range_obj2 = round(c(diff(range(df_base$obj2[df_base$pareto])),
diff(range(df_enh$obj2[df_enh$pareto]))), 6)
)
resumen
#> Strategy Non_dominated Min_obj1 Min_obj2 Range_obj1 Range_obj2
#> 1 NSGA-II only 6 5.584935 4.480562 1.520765 0.331992
#> 2 NSGA-II + PR + PLS 12 4.964403 3.927521 0.186094 0.732287Because the synthetic dataset has known biological process labels, we
can measure how well each strategy recovers the true structure. We pick
the best solution on the front by the sum of the two objectives, then
compare its partition against geneexpr$process using the
Adjusted Rand Index.
best_solution <- function(res) {
pop <- res$population
pareto <- pop[pop$paretoranking == 1, , drop = FALSE]
idx <- which.min(pareto$obj1 + pareto$obj2)
res$clustering[[idx]]
}
cl_base <- best_solution(res_baseline)
cl_enh <- best_solution(res_enhanced)
# Simple Adjusted Rand Index implementation to avoid extra dependencies
ari <- function(a, b) {
tab <- table(a, b)
n <- sum(tab)
sum_comb <- function(x) sum(choose(x, 2))
a_sum <- sum_comb(rowSums(tab))
b_sum <- sum_comb(colSums(tab))
index <- sum_comb(as.vector(tab))
expected <- a_sum * b_sum / choose(n, 2)
max_index <- (a_sum + b_sum) / 2
(index - expected) / (max_index - expected)
}
cat("ARI baseline:", round(ari(geneexpr$process, cl_base), 3), "\n")
#> ARI baseline: 1
cat("ARI enhanced:", round(ari(geneexpr$process, cl_enh), 3), "\n")
#> ARI enhanced: 1Version 0.3.0 rewrote both local search routines:
fastNonDominatedSorting() call to decide which solutions
survive..pr_pick_best()
instead of calling the full ranking-crowding pipeline at every step of
the path.The acceptance rules now follow the batched textbook formulation. The
calibration of set.seed() between versions 0.2.x and 0.3.0
therefore yields different but statistically equivalent fronts. See
NEWS.md for the full list of changes.
Parraga-Alava, J., Dorn, M., Inostroza-Ponta, M. (2018). A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Mining, 11(1), 1-16. doi:10.1186/s13040-018-0178-4
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182-197.
Glover, F. (1997). Tabu Search and Adaptive Memory Programming: Advances, Applications and Challenges. Interfaces in Computer Science and Operations Research, 1-75.
Dubois-Lacoste, J., Lopez-Ibanez, M., Stutzle, T. (2015). Anytime Pareto local search. European Journal of Operational Research, 243(2), 369-385.
geneexpr dataset