--- title: "Introduction to moc.gapbk: multi-objective clustering with a-priori biological knowledge" author: "Jorge Parraga-Alava, Marcio Dorn, Mario Inostroza-Ponta" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_width: 8 fig_height: 5 vignette: > %\VignetteIndexEntry{Introduction to moc.gapbk} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", dpi = 96 ) set.seed(2026) ``` ## 1. Overview `moc.gapbk` implements the **Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge** (MOC-GaPBK) introduced by Parraga-Alava and colleagues (2018). The algorithm couples the well-known NSGA-II evolutionary engine with two memetic strategies, **Path-Relinking** and **Pareto Local Search**, and minimises two Xie-Beni cluster validity indices simultaneously, one per distance matrix. The two-objective formulation is what makes this algorithm useful for gene expression clustering: one matrix typically captures the expression patterns (e.g. Euclidean distance on expression values) and the other encodes a-priori biological knowledge (e.g. Gene Ontology semantic similarity). The algorithm returns the full Pareto front of clustering solutions, so the analyst can choose a trade-off explicitly rather than committing to a single weighted aggregate. This vignette walks through: 1. The structure of the bundled `geneexpr` dataset. 2. A baseline run of `moc.gapbk` without local search. 3. A second run with Path-Relinking and Pareto Local Search enabled. 4. Visual comparison of the two Pareto fronts. 5. Evaluation of clustering quality against the ground truth. ## 2. The `geneexpr` dataset The package ships with a small synthetic dataset designed to mimic the two information sources that a real gene expression study would provide. ```{r load-data} library(moc.gapbk) data(geneexpr) str(geneexpr, max.level = 1) ``` The dataset contains three elements: - `geneexpr$expression`: a 100 by 20 matrix of expression values for 100 genes measured under 20 conditions. - `geneexpr$go_sim`: a 100 by 100 matrix of simulated Gene Ontology semantic similarity values in `[0, 1]`. - `geneexpr$process`: the ground-truth biological process membership of each gene (5 processes, 20 genes per process). ```{r ground-truth} table(geneexpr$process) ``` A quick look at the structure of the expression patterns by process: ```{r expression-heatmap, fig.height = 4.5, fig.width = 8} op <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) image(t(geneexpr$expression[order(geneexpr$process), ]), col = hcl.colors(50, "RdBu", rev = TRUE), xlab = "Condition", ylab = "Gene (sorted by process)", main = "Expression matrix") image(geneexpr$go_sim[order(geneexpr$process), order(geneexpr$process)], col = hcl.colors(50, "YlOrRd", rev = TRUE), xlab = "Gene", ylab = "Gene", main = "GO semantic similarity") par(op) ``` The block-diagonal structure of the GO similarity matrix reflects the fact that genes within the same biological process share more semantic content than genes from different processes. ## 3. Preparing the two distance matrices `moc.gapbk` requires two square distance matrices of identical shape. The first matrix is built from the expression values using a standard Euclidean distance. The second matrix is built from the GO semantic similarity by taking the complement. ```{r build-distances} d_expr <- as.matrix(stats::dist(geneexpr$expression, method = "euclidean")) d_go <- 1 - geneexpr$go_sim dim(d_expr) dim(d_go) ``` Both matrices have shape 100 by 100, one row and column per gene. ## 4. Baseline run: NSGA-II only A first call without local search uses the NSGA-II engine alone. This is the fastest configuration and serves as a baseline. ```{r run-baseline, results = "hide"} set.seed(2026) res_baseline <- moc.gapbk( dmatrix1 = d_expr, dmatrix2 = d_go, num_k = 5, generation = 30, pop_size = 12, local_search = FALSE ) ``` The returned object has three elements: ```{r baseline-structure} names(res_baseline) ``` - `population`: the final NSGA-II population with objectives, Pareto rank and crowding distance. - `matrix.solutions`: one column per non-dominated solution; rows are cluster assignments. - `clustering`: a list version of the same partitions. ```{r baseline-quick} head(res_baseline$population) ``` ## 5. Enhanced run: NSGA-II + Path-Relinking + Pareto Local Search Enabling `local_search = TRUE` activates Path-Relinking (PR) and Pareto Local Search (PLS). The version 0.3.0 of the package vectorises and batches both routines, so the cost is much lower than it was in earlier releases. ```{r run-enhanced, results = "hide"} set.seed(2026) res_enhanced <- moc.gapbk( dmatrix1 = d_expr, dmatrix2 = d_go, num_k = 5, generation = 30, pop_size = 12, neighborhood = 0.10, local_search = TRUE, cores = 2 ) ``` ## 6. Visual comparison of the Pareto fronts When `ggplot2` and `patchwork` are available, the two fronts can be plotted side by side for direct visual comparison. ```{r plot-fronts, fig.height = 5, fig.width = 11, eval = requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("patchwork", quietly = TRUE)} library(ggplot2) library(patchwork) prep <- function(res, label) { pop <- res$population data.frame( obj1 = pop$obj1, obj2 = pop$obj2, pareto = pop$paretoranking == 1, modo = label ) } df_base <- prep(res_baseline, "NSGA-II only") df_enh <- prep(res_enhanced, "NSGA-II + PR + PLS") tema <- theme_minimal(base_size = 12) + theme(plot.title = element_text(face = "bold"), legend.position = "top", panel.grid.minor = element_blank(), plot.background = element_rect(fill = "white", color = NA)) plot_front <- function(df, titulo, col_pareto) { pareto <- df[df$pareto, ] pareto <- pareto[order(pareto$obj1), ] ggplot(df, aes(x = obj1, y = obj2)) + geom_point(data = df[!df$pareto, ], color = "grey75", size = 2.2, alpha = 0.6) + geom_step(data = pareto, direction = "vh", color = col_pareto, linewidth = 0.7, alpha = 0.6) + geom_point(data = pareto, color = col_pareto, size = 3.3) + geom_point(data = pareto, shape = 21, fill = NA, color = "white", size = 4, stroke = 0.7) + labs(title = titulo, subtitle = paste0("Non-dominated: ", nrow(pareto), " | Total: ", nrow(df)), x = expression(paste("Objective 1: Xie-Beni on ", bold(D)[expr])), y = expression(paste("Objective 2: Xie-Beni on ", bold(D)[GO]))) + tema } p1 <- plot_front(df_base, "NSGA-II only", "#2E86AB") p2 <- plot_front(df_enh, "NSGA-II + PR + PLS", "#C73E1D") p1 + p2 + plot_annotation( title = "Pareto fronts on the geneexpr dataset", subtitle = "Effect of activating Path-Relinking and Pareto Local Search", theme = theme(plot.title = element_text(face = "bold", size = 14), plot.subtitle = element_text(color = "grey30")) ) ``` A second view overlays both fronts so they can be compared directly: ```{r plot-overlay, fig.height = 5, fig.width = 8, eval = requireNamespace("ggplot2", quietly = TRUE)} df_all <- rbind(df_base, df_enh) df_front <- df_all[df_all$pareto, ] df_front <- df_front[order(df_front$modo, df_front$obj1), ] ggplot(df_front, aes(x = obj1, y = obj2, color = modo, fill = modo, shape = modo)) + geom_step(direction = "vh", linewidth = 0.8, alpha = 0.7) + geom_point(size = 3.5, alpha = 0.95) + geom_point(shape = 21, fill = NA, color = "white", size = 4.2, stroke = 0.7) + scale_color_manual(values = c("NSGA-II only" = "#2E86AB", "NSGA-II + PR + PLS" = "#C73E1D")) + scale_fill_manual(values = c("NSGA-II only" = "#2E86AB", "NSGA-II + PR + PLS" = "#C73E1D")) + scale_shape_manual(values = c("NSGA-II only" = 16, "NSGA-II + PR + PLS" = 17)) + labs(title = "Overlay of Pareto fronts", subtitle = "A front closer to the origin and with more points is better", x = expression(paste("Objective 1: Xie-Beni on ", bold(D)[expr])), y = expression(paste("Objective 2: Xie-Beni on ", bold(D)[GO])), color = "Strategy", fill = "Strategy", shape = "Strategy") + tema ``` ## 7. Quantitative comparison We can also tabulate the size and reach of each front: ```{r front-summary} resumen <- data.frame( Strategy = c("NSGA-II only", "NSGA-II + PR + PLS"), Non_dominated = c(sum(df_base$pareto), sum(df_enh$pareto)), Min_obj1 = round(c(min(df_base$obj1[df_base$pareto]), min(df_enh$obj1[df_enh$pareto])), 6), Min_obj2 = round(c(min(df_base$obj2[df_base$pareto]), min(df_enh$obj2[df_enh$pareto])), 6), Range_obj1 = round(c(diff(range(df_base$obj1[df_base$pareto])), diff(range(df_enh$obj1[df_enh$pareto]))), 6), Range_obj2 = round(c(diff(range(df_base$obj2[df_base$pareto])), diff(range(df_enh$obj2[df_enh$pareto]))), 6) ) resumen ``` ## 8. Recovering ground truth Because the synthetic dataset has known biological process labels, we can measure how well each strategy recovers the true structure. We pick the best solution on the front by the sum of the two objectives, then compare its partition against `geneexpr$process` using the Adjusted Rand Index. ```{r ground-truth-eval} best_solution <- function(res) { pop <- res$population pareto <- pop[pop$paretoranking == 1, , drop = FALSE] idx <- which.min(pareto$obj1 + pareto$obj2) res$clustering[[idx]] } cl_base <- best_solution(res_baseline) cl_enh <- best_solution(res_enhanced) # Simple Adjusted Rand Index implementation to avoid extra dependencies ari <- function(a, b) { tab <- table(a, b) n <- sum(tab) sum_comb <- function(x) sum(choose(x, 2)) a_sum <- sum_comb(rowSums(tab)) b_sum <- sum_comb(colSums(tab)) index <- sum_comb(as.vector(tab)) expected <- a_sum * b_sum / choose(n, 2) max_index <- (a_sum + b_sum) / 2 (index - expected) / (max_index - expected) } cat("ARI baseline:", round(ari(geneexpr$process, cl_base), 3), "\n") cat("ARI enhanced:", round(ari(geneexpr$process, cl_enh), 3), "\n") ``` ## 9. Performance notes for version 0.3.0 Version 0.3.0 rewrote both local search routines: - **Pareto Local Search** now batches the entire neighborhood of the current incumbent, evaluates all candidates in a single vectorised pass and uses one `fastNonDominatedSorting()` call to decide which solutions survive. - **Path-Relinking** evaluates the batch of intermediate solutions in a single vectorised pass via `.pr_pick_best()` instead of calling the full ranking-crowding pipeline at every step of the path. The acceptance rules now follow the batched textbook formulation. The calibration of `set.seed()` between versions 0.2.x and 0.3.0 therefore yields different but statistically equivalent fronts. See `NEWS.md` for the full list of changes. ## 10. References - Parraga-Alava, J., Dorn, M., Inostroza-Ponta, M. (2018). A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. *BioData Mining*, 11(1), 1-16. [doi:10.1186/s13040-018-0178-4](https://doi.org/10.1186/s13040-018-0178-4) - Deb, K., Pratap, A., Agarwal, S., Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. *IEEE Transactions on Evolutionary Computation*, 6(2), 182-197. - Glover, F. (1997). Tabu Search and Adaptive Memory Programming: Advances, Applications and Challenges. *Interfaces in Computer Science and Operations Research*, 1-75. - Dubois-Lacoste, J., Lopez-Ibanez, M., Stutzle, T. (2015). Anytime Pareto local search. *European Journal of Operational Research*, 243(2), 369-385.