---
title: "Introduction to moc.gapbk: multi-objective clustering with a-priori biological knowledge"
author: "Jorge Parraga-Alava, Marcio Dorn, Mario Inostroza-Ponta"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_width: 8
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{Introduction to moc.gapbk}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center",
  dpi = 96
)
set.seed(2026)
```

## 1. Overview

`moc.gapbk` implements the **Multi-Objective Clustering Algorithm
Guided by a-Priori Biological Knowledge** (MOC-GaPBK) introduced by
Parraga-Alava and colleagues (2018). The algorithm couples the
well-known NSGA-II evolutionary engine with two memetic strategies,
**Path-Relinking** and **Pareto Local Search**, and minimises two
Xie-Beni cluster validity indices simultaneously, one per distance
matrix.

The two-objective formulation is what makes this algorithm useful for
gene expression clustering: one matrix typically captures the
expression patterns (e.g. Euclidean distance on expression values) and
the other encodes a-priori biological knowledge (e.g. Gene Ontology
semantic similarity). The algorithm returns the full Pareto front of
clustering solutions, so the analyst can choose a trade-off explicitly
rather than committing to a single weighted aggregate.

This vignette walks through:

1. The structure of the bundled `geneexpr` dataset.
2. A baseline run of `moc.gapbk` without local search.
3. A second run with Path-Relinking and Pareto Local Search enabled.
4. Visual comparison of the two Pareto fronts.
5. Evaluation of clustering quality against the ground truth.

## 2. The `geneexpr` dataset

The package ships with a small synthetic dataset designed to mimic the
two information sources that a real gene expression study would
provide.

```{r load-data}
library(moc.gapbk)
data(geneexpr)

str(geneexpr, max.level = 1)
```

The dataset contains three elements:

- `geneexpr$expression`: a 100 by 20 matrix of expression values for
  100 genes measured under 20 conditions.
- `geneexpr$go_sim`: a 100 by 100 matrix of simulated Gene Ontology
  semantic similarity values in `[0, 1]`.
- `geneexpr$process`: the ground-truth biological process membership
  of each gene (5 processes, 20 genes per process).

```{r ground-truth}
table(geneexpr$process)
```

A quick look at the structure of the expression patterns by process:

```{r expression-heatmap, fig.height = 4.5, fig.width = 8}
op <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

image(t(geneexpr$expression[order(geneexpr$process), ]),
      col = hcl.colors(50, "RdBu", rev = TRUE),
      xlab = "Condition", ylab = "Gene (sorted by process)",
      main = "Expression matrix")

image(geneexpr$go_sim[order(geneexpr$process),
                      order(geneexpr$process)],
      col = hcl.colors(50, "YlOrRd", rev = TRUE),
      xlab = "Gene", ylab = "Gene",
      main = "GO semantic similarity")

par(op)
```

The block-diagonal structure of the GO similarity matrix reflects the
fact that genes within the same biological process share more semantic
content than genes from different processes.

## 3. Preparing the two distance matrices

`moc.gapbk` requires two square distance matrices of identical shape.
The first matrix is built from the expression values using a standard
Euclidean distance. The second matrix is built from the GO semantic
similarity by taking the complement.

```{r build-distances}
d_expr <- as.matrix(stats::dist(geneexpr$expression, method = "euclidean"))
d_go   <- 1 - geneexpr$go_sim

dim(d_expr)
dim(d_go)
```

Both matrices have shape 100 by 100, one row and column per gene.

## 4. Baseline run: NSGA-II only

A first call without local search uses the NSGA-II engine alone. This
is the fastest configuration and serves as a baseline.

```{r run-baseline, results = "hide"}
set.seed(2026)
res_baseline <- moc.gapbk(
  dmatrix1     = d_expr,
  dmatrix2     = d_go,
  num_k        = 5,
  generation   = 30,
  pop_size     = 12,
  local_search = FALSE
)
```

The returned object has three elements:

```{r baseline-structure}
names(res_baseline)
```

- `population`: the final NSGA-II population with objectives, Pareto
  rank and crowding distance.
- `matrix.solutions`: one column per non-dominated solution; rows are
  cluster assignments.
- `clustering`: a list version of the same partitions.

```{r baseline-quick}
head(res_baseline$population)
```

## 5. Enhanced run: NSGA-II + Path-Relinking + Pareto Local Search

Enabling `local_search = TRUE` activates Path-Relinking (PR) and
Pareto Local Search (PLS). The version 0.3.0 of the package vectorises
and batches both routines, so the cost is much lower than it was in
earlier releases.

```{r run-enhanced, results = "hide"}
set.seed(2026)
res_enhanced <- moc.gapbk(
  dmatrix1     = d_expr,
  dmatrix2     = d_go,
  num_k        = 5,
  generation   = 30,
  pop_size     = 12,
  neighborhood = 0.10,
  local_search = TRUE,
  cores        = 2
)
```

## 6. Visual comparison of the Pareto fronts

When `ggplot2` and `patchwork` are available, the two fronts can be
plotted side by side for direct visual comparison.

```{r plot-fronts, fig.height = 5, fig.width = 11, eval = requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("patchwork", quietly = TRUE)}
library(ggplot2)
library(patchwork)

prep <- function(res, label) {
  pop <- res$population
  data.frame(
    obj1   = pop$obj1,
    obj2   = pop$obj2,
    pareto = pop$paretoranking == 1,
    modo   = label
  )
}

df_base <- prep(res_baseline, "NSGA-II only")
df_enh  <- prep(res_enhanced, "NSGA-II + PR + PLS")

tema <- theme_minimal(base_size = 12) +
  theme(plot.title       = element_text(face = "bold"),
        legend.position  = "top",
        panel.grid.minor = element_blank(),
        plot.background  = element_rect(fill = "white", color = NA))

plot_front <- function(df, titulo, col_pareto) {
  pareto <- df[df$pareto, ]
  pareto <- pareto[order(pareto$obj1), ]
  ggplot(df, aes(x = obj1, y = obj2)) +
    geom_point(data = df[!df$pareto, ],
               color = "grey75", size = 2.2, alpha = 0.6) +
    geom_step(data = pareto, direction = "vh",
              color = col_pareto, linewidth = 0.7, alpha = 0.6) +
    geom_point(data = pareto, color = col_pareto, size = 3.3) +
    geom_point(data = pareto, shape = 21, fill = NA, color = "white",
               size = 4, stroke = 0.7) +
    labs(title = titulo,
         subtitle = paste0("Non-dominated: ", nrow(pareto),
                          " | Total: ", nrow(df)),
         x = expression(paste("Objective 1: Xie-Beni on ", bold(D)[expr])),
         y = expression(paste("Objective 2: Xie-Beni on ", bold(D)[GO]))) +
    tema
}

p1 <- plot_front(df_base, "NSGA-II only",          "#2E86AB")
p2 <- plot_front(df_enh,  "NSGA-II + PR + PLS",    "#C73E1D")

p1 + p2 + plot_annotation(
  title    = "Pareto fronts on the geneexpr dataset",
  subtitle = "Effect of activating Path-Relinking and Pareto Local Search",
  theme    = theme(plot.title    = element_text(face = "bold", size = 14),
                   plot.subtitle = element_text(color = "grey30"))
)
```

A second view overlays both fronts so they can be compared directly:

```{r plot-overlay, fig.height = 5, fig.width = 8, eval = requireNamespace("ggplot2", quietly = TRUE)}
df_all   <- rbind(df_base, df_enh)
df_front <- df_all[df_all$pareto, ]
df_front <- df_front[order(df_front$modo, df_front$obj1), ]

ggplot(df_front, aes(x = obj1, y = obj2,
                     color = modo, fill = modo, shape = modo)) +
  geom_step(direction = "vh", linewidth = 0.8, alpha = 0.7) +
  geom_point(size = 3.5, alpha = 0.95) +
  geom_point(shape = 21, fill = NA, color = "white", size = 4.2,
             stroke = 0.7) +
  scale_color_manual(values = c("NSGA-II only"        = "#2E86AB",
                                 "NSGA-II + PR + PLS" = "#C73E1D")) +
  scale_fill_manual(values  = c("NSGA-II only"        = "#2E86AB",
                                 "NSGA-II + PR + PLS" = "#C73E1D")) +
  scale_shape_manual(values = c("NSGA-II only"        = 16,
                                 "NSGA-II + PR + PLS" = 17)) +
  labs(title    = "Overlay of Pareto fronts",
       subtitle = "A front closer to the origin and with more points is better",
       x        = expression(paste("Objective 1: Xie-Beni on ", bold(D)[expr])),
       y        = expression(paste("Objective 2: Xie-Beni on ", bold(D)[GO])),
       color    = "Strategy", fill = "Strategy", shape = "Strategy") +
  tema
```

## 7. Quantitative comparison

We can also tabulate the size and reach of each front:

```{r front-summary}
resumen <- data.frame(
  Strategy           = c("NSGA-II only", "NSGA-II + PR + PLS"),
  Non_dominated      = c(sum(df_base$pareto),   sum(df_enh$pareto)),
  Min_obj1           = round(c(min(df_base$obj1[df_base$pareto]),
                               min(df_enh$obj1[df_enh$pareto])),  6),
  Min_obj2           = round(c(min(df_base$obj2[df_base$pareto]),
                               min(df_enh$obj2[df_enh$pareto])),  6),
  Range_obj1         = round(c(diff(range(df_base$obj1[df_base$pareto])),
                               diff(range(df_enh$obj1[df_enh$pareto]))), 6),
  Range_obj2         = round(c(diff(range(df_base$obj2[df_base$pareto])),
                               diff(range(df_enh$obj2[df_enh$pareto]))), 6)
)
resumen
```

## 8. Recovering ground truth

Because the synthetic dataset has known biological process labels, we
can measure how well each strategy recovers the true structure. We pick
the best solution on the front by the sum of the two objectives, then
compare its partition against `geneexpr$process` using the Adjusted
Rand Index.

```{r ground-truth-eval}
best_solution <- function(res) {
  pop <- res$population
  pareto <- pop[pop$paretoranking == 1, , drop = FALSE]
  idx <- which.min(pareto$obj1 + pareto$obj2)
  res$clustering[[idx]]
}

cl_base <- best_solution(res_baseline)
cl_enh  <- best_solution(res_enhanced)

# Simple Adjusted Rand Index implementation to avoid extra dependencies
ari <- function(a, b) {
  tab <- table(a, b)
  n   <- sum(tab)
  sum_comb <- function(x) sum(choose(x, 2))
  a_sum <- sum_comb(rowSums(tab))
  b_sum <- sum_comb(colSums(tab))
  index <- sum_comb(as.vector(tab))
  expected <- a_sum * b_sum / choose(n, 2)
  max_index <- (a_sum + b_sum) / 2
  (index - expected) / (max_index - expected)
}

cat("ARI baseline:", round(ari(geneexpr$process, cl_base), 3), "\n")
cat("ARI enhanced:", round(ari(geneexpr$process, cl_enh),  3), "\n")
```

## 9. Performance notes for version 0.3.0

Version 0.3.0 rewrote both local search routines:

- **Pareto Local Search** now batches the entire neighborhood of the
  current incumbent, evaluates all candidates in a single vectorised
  pass and uses one `fastNonDominatedSorting()` call to decide which
  solutions survive.
- **Path-Relinking** evaluates the batch of intermediate solutions in
  a single vectorised pass via `.pr_pick_best()` instead of calling
  the full ranking-crowding pipeline at every step of the path.

The acceptance rules now follow the batched textbook formulation. The
calibration of `set.seed()` between versions 0.2.x and 0.3.0 therefore
yields different but statistically equivalent fronts. See `NEWS.md`
for the full list of changes.

## 10. References

- Parraga-Alava, J., Dorn, M., Inostroza-Ponta, M. (2018). A
  multi-objective gene clustering algorithm guided by apriori
  biological knowledge with intensification and diversification
  strategies. *BioData Mining*, 11(1), 1-16.
  [doi:10.1186/s13040-018-0178-4](https://doi.org/10.1186/s13040-018-0178-4)

- Deb, K., Pratap, A., Agarwal, S., Meyarivan, T. (2002). A fast and
  elitist multiobjective genetic algorithm: NSGA-II. *IEEE Transactions
  on Evolutionary Computation*, 6(2), 182-197.

- Glover, F. (1997). Tabu Search and Adaptive Memory Programming:
  Advances, Applications and Challenges. *Interfaces in Computer
  Science and Operations Research*, 1-75.

- Dubois-Lacoste, J., Lopez-Ibanez, M., Stutzle, T. (2015). Anytime
  Pareto local search. *European Journal of Operational Research*,
  243(2), 369-385.