Introduction to moc.gapbk: multi-objective clustering with a-priori biological knowledge

1. Overview

moc.gapbk implements the Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge (MOC-GaPBK) introduced by Parraga-Alava and colleagues (2018). The algorithm couples the well-known NSGA-II evolutionary engine with two memetic strategies, Path-Relinking and Pareto Local Search, and minimises two Xie-Beni cluster validity indices simultaneously, one per distance matrix.

The two-objective formulation is what makes this algorithm useful for gene expression clustering: one matrix typically captures the expression patterns (e.g. Euclidean distance on expression values) and the other encodes a-priori biological knowledge (e.g. Gene Ontology semantic similarity). The algorithm returns the full Pareto front of clustering solutions, so the analyst can choose a trade-off explicitly rather than committing to a single weighted aggregate.

This vignette walks through:

  1. The structure of the bundled geneexpr dataset.
  2. A baseline run of moc.gapbk without local search.
  3. A second run with Path-Relinking and Pareto Local Search enabled.
  4. Visual comparison of the two Pareto fronts.
  5. Evaluation of clustering quality against the ground truth.

2. The geneexpr dataset

The package ships with a small synthetic dataset designed to mimic the two information sources that a real gene expression study would provide.

library(moc.gapbk)
data(geneexpr)

str(geneexpr, max.level = 1)
#> List of 3
#>  $ expression: num [1:100, 1:20] 0.2603 -0.5398 0.0696 -0.0424 -0.3333 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ go_sim    : num [1:100, 1:100] 1 0.895 0.792 0.851 0.786 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ process   : Factor w/ 5 levels "p1","p2","p3",..: 1 1 1 1 1 1 1 1 1 1 ...

The dataset contains three elements:

  • geneexpr$expression: a 100 by 20 matrix of expression values for 100 genes measured under 20 conditions.
  • geneexpr$go_sim: a 100 by 100 matrix of simulated Gene Ontology semantic similarity values in [0, 1].
  • geneexpr$process: the ground-truth biological process membership of each gene (5 processes, 20 genes per process).
table(geneexpr$process)
#> 
#> p1 p2 p3 p4 p5 
#> 20 20 20 20 20

A quick look at the structure of the expression patterns by process:

op <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

image(t(geneexpr$expression[order(geneexpr$process), ]),
      col = hcl.colors(50, "RdBu", rev = TRUE),
      xlab = "Condition", ylab = "Gene (sorted by process)",
      main = "Expression matrix")

image(geneexpr$go_sim[order(geneexpr$process),
                      order(geneexpr$process)],
      col = hcl.colors(50, "YlOrRd", rev = TRUE),
      xlab = "Gene", ylab = "Gene",
      main = "GO semantic similarity")


par(op)

The block-diagonal structure of the GO similarity matrix reflects the fact that genes within the same biological process share more semantic content than genes from different processes.

3. Preparing the two distance matrices

moc.gapbk requires two square distance matrices of identical shape. The first matrix is built from the expression values using a standard Euclidean distance. The second matrix is built from the GO semantic similarity by taking the complement.

d_expr <- as.matrix(stats::dist(geneexpr$expression, method = "euclidean"))
d_go   <- 1 - geneexpr$go_sim

dim(d_expr)
#> [1] 100 100
dim(d_go)
#> [1] 100 100

Both matrices have shape 100 by 100, one row and column per gene.

4. Baseline run: NSGA-II only

A first call without local search uses the NSGA-II engine alone. This is the fastest configuration and serves as a baseline.

set.seed(2026)
res_baseline <- moc.gapbk(
  dmatrix1     = d_expr,
  dmatrix2     = d_go,
  num_k        = 5,
  generation   = 30,
  pop_size     = 12,
  local_search = FALSE
)

The returned object has three elements:

names(res_baseline)
#> [1] "population"       "matrix.solutions" "clustering"
  • population: the final NSGA-II population with objectives, Pareto rank and crowding distance.
  • matrix.solutions: one column per non-dominated solution; rows are cluster assignments.
  • clustering: a list version of the same partitions.
head(res_baseline$population)
#>    V1 V2 V3 V4 V5     obj1     obj2 paretoranking    crowding
#> 1  66 36 93 47  5 7.105700 4.480562             1         Inf
#> 2   3 61 25 43 87 5.584935 4.812554             1         Inf
#> 13  3 45 25 77 87 5.806474 4.724716             1 0.039417336
#> 4   3 61 25 50 87 5.702521 4.793325             1 0.005283226
#> 6   3 45 25 61 87 5.638220 4.793627             1 0.002990073
#> 18  3 61 25 58 87 5.603635 4.795165             1 0.001095566

6. Visual comparison of the Pareto fronts

When ggplot2 and patchwork are available, the two fronts can be plotted side by side for direct visual comparison.

library(ggplot2)
library(patchwork)

prep <- function(res, label) {
  pop <- res$population
  data.frame(
    obj1   = pop$obj1,
    obj2   = pop$obj2,
    pareto = pop$paretoranking == 1,
    modo   = label
  )
}

df_base <- prep(res_baseline, "NSGA-II only")
df_enh  <- prep(res_enhanced, "NSGA-II + PR + PLS")

tema <- theme_minimal(base_size = 12) +
  theme(plot.title       = element_text(face = "bold"),
        legend.position  = "top",
        panel.grid.minor = element_blank(),
        plot.background  = element_rect(fill = "white", color = NA))

plot_front <- function(df, titulo, col_pareto) {
  pareto <- df[df$pareto, ]
  pareto <- pareto[order(pareto$obj1), ]
  ggplot(df, aes(x = obj1, y = obj2)) +
    geom_point(data = df[!df$pareto, ],
               color = "grey75", size = 2.2, alpha = 0.6) +
    geom_step(data = pareto, direction = "vh",
              color = col_pareto, linewidth = 0.7, alpha = 0.6) +
    geom_point(data = pareto, color = col_pareto, size = 3.3) +
    geom_point(data = pareto, shape = 21, fill = NA, color = "white",
               size = 4, stroke = 0.7) +
    labs(title = titulo,
         subtitle = paste0("Non-dominated: ", nrow(pareto),
                          " | Total: ", nrow(df)),
         x = expression(paste("Objective 1: Xie-Beni on ", bold(D)[expr])),
         y = expression(paste("Objective 2: Xie-Beni on ", bold(D)[GO]))) +
    tema
}

p1 <- plot_front(df_base, "NSGA-II only",          "#2E86AB")
p2 <- plot_front(df_enh,  "NSGA-II + PR + PLS",    "#C73E1D")

p1 + p2 + plot_annotation(
  title    = "Pareto fronts on the geneexpr dataset",
  subtitle = "Effect of activating Path-Relinking and Pareto Local Search",
  theme    = theme(plot.title    = element_text(face = "bold", size = 14),
                   plot.subtitle = element_text(color = "grey30"))
)

A second view overlays both fronts so they can be compared directly:

df_all   <- rbind(df_base, df_enh)
df_front <- df_all[df_all$pareto, ]
df_front <- df_front[order(df_front$modo, df_front$obj1), ]

ggplot(df_front, aes(x = obj1, y = obj2,
                     color = modo, fill = modo, shape = modo)) +
  geom_step(direction = "vh", linewidth = 0.8, alpha = 0.7) +
  geom_point(size = 3.5, alpha = 0.95) +
  geom_point(shape = 21, fill = NA, color = "white", size = 4.2,
             stroke = 0.7) +
  scale_color_manual(values = c("NSGA-II only"        = "#2E86AB",
                                 "NSGA-II + PR + PLS" = "#C73E1D")) +
  scale_fill_manual(values  = c("NSGA-II only"        = "#2E86AB",
                                 "NSGA-II + PR + PLS" = "#C73E1D")) +
  scale_shape_manual(values = c("NSGA-II only"        = 16,
                                 "NSGA-II + PR + PLS" = 17)) +
  labs(title    = "Overlay of Pareto fronts",
       subtitle = "A front closer to the origin and with more points is better",
       x        = expression(paste("Objective 1: Xie-Beni on ", bold(D)[expr])),
       y        = expression(paste("Objective 2: Xie-Beni on ", bold(D)[GO])),
       color    = "Strategy", fill = "Strategy", shape = "Strategy") +
  tema

7. Quantitative comparison

We can also tabulate the size and reach of each front:

resumen <- data.frame(
  Strategy           = c("NSGA-II only", "NSGA-II + PR + PLS"),
  Non_dominated      = c(sum(df_base$pareto),   sum(df_enh$pareto)),
  Min_obj1           = round(c(min(df_base$obj1[df_base$pareto]),
                               min(df_enh$obj1[df_enh$pareto])),  6),
  Min_obj2           = round(c(min(df_base$obj2[df_base$pareto]),
                               min(df_enh$obj2[df_enh$pareto])),  6),
  Range_obj1         = round(c(diff(range(df_base$obj1[df_base$pareto])),
                               diff(range(df_enh$obj1[df_enh$pareto]))), 6),
  Range_obj2         = round(c(diff(range(df_base$obj2[df_base$pareto])),
                               diff(range(df_enh$obj2[df_enh$pareto]))), 6)
)
resumen
#>             Strategy Non_dominated Min_obj1 Min_obj2 Range_obj1 Range_obj2
#> 1       NSGA-II only             6 5.584935 4.480562   1.520765   0.331992
#> 2 NSGA-II + PR + PLS            12 4.964403 3.927521   0.186094   0.732287

8. Recovering ground truth

Because the synthetic dataset has known biological process labels, we can measure how well each strategy recovers the true structure. We pick the best solution on the front by the sum of the two objectives, then compare its partition against geneexpr$process using the Adjusted Rand Index.

best_solution <- function(res) {
  pop <- res$population
  pareto <- pop[pop$paretoranking == 1, , drop = FALSE]
  idx <- which.min(pareto$obj1 + pareto$obj2)
  res$clustering[[idx]]
}

cl_base <- best_solution(res_baseline)
cl_enh  <- best_solution(res_enhanced)

# Simple Adjusted Rand Index implementation to avoid extra dependencies
ari <- function(a, b) {
  tab <- table(a, b)
  n   <- sum(tab)
  sum_comb <- function(x) sum(choose(x, 2))
  a_sum <- sum_comb(rowSums(tab))
  b_sum <- sum_comb(colSums(tab))
  index <- sum_comb(as.vector(tab))
  expected <- a_sum * b_sum / choose(n, 2)
  max_index <- (a_sum + b_sum) / 2
  (index - expected) / (max_index - expected)
}

cat("ARI baseline:", round(ari(geneexpr$process, cl_base), 3), "\n")
#> ARI baseline: 1
cat("ARI enhanced:", round(ari(geneexpr$process, cl_enh),  3), "\n")
#> ARI enhanced: 1

9. Performance notes for version 0.3.0

Version 0.3.0 rewrote both local search routines:

  • Pareto Local Search now batches the entire neighborhood of the current incumbent, evaluates all candidates in a single vectorised pass and uses one fastNonDominatedSorting() call to decide which solutions survive.
  • Path-Relinking evaluates the batch of intermediate solutions in a single vectorised pass via .pr_pick_best() instead of calling the full ranking-crowding pipeline at every step of the path.

The acceptance rules now follow the batched textbook formulation. The calibration of set.seed() between versions 0.2.x and 0.3.0 therefore yields different but statistically equivalent fronts. See NEWS.md for the full list of changes.

10. References

  • Parraga-Alava, J., Dorn, M., Inostroza-Ponta, M. (2018). A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Mining, 11(1), 1-16. doi:10.1186/s13040-018-0178-4

  • Deb, K., Pratap, A., Agarwal, S., Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182-197.

  • Glover, F. (1997). Tabu Search and Adaptive Memory Programming: Advances, Applications and Challenges. Interfaces in Computer Science and Operations Research, 1-75.

  • Dubois-Lacoste, J., Lopez-Ibanez, M., Stutzle, T. (2015). Anytime Pareto local search. European Journal of Operational Research, 243(2), 369-385.