Package 'moc.gapbk'

Title: Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge
Description: Implements the Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge ('MOC-GaPBK') proposed by Parraga-Alava and others (2018) <doi:10.1186/s13040-018-0178-4>. The algorithm performs gene clustering using 'NSGA-II' as the underlying multi-objective evolutionary engine, together with Path-Relinking and Pareto Local Search as intensification and diversification strategies. Two versions of the Xie-Beni validity index are used as objective functions, one per distance matrix, so that prior biological knowledge can be incorporated through the second matrix.
Authors: Jorge Parraga-Alava [aut, cre, cph] (ORCID: <https://orcid.org/0000-0001-8558-9122>), Marcio Dorn [aut], Mario Inostroza-Ponta [aut]
Maintainer: Jorge Parraga-Alava <[email protected]>
License: GPL-2
Version: 0.3.0
Built: 2026-05-15 19:27:46 UTC
Source: https://github.com/jorgeklz/package-moc.gapbk

Help Index


Synthetic gene expression dataset with Gene Ontology similarity

Description

A simulated gene expression dataset designed to illustrate the multi-objective clustering capabilities of moc.gapbk. The dataset reproduces the typical structure found in real gene expression studies coupled with a-priori biological knowledge from the Gene Ontology (GO).

Usage

geneexpr

Format

A named list with three elements:

expression

A numeric matrix of 100 rows (genes) and 20 columns (experimental conditions). Each gene belongs to one of five biological processes and follows a noisy version of that process's prototype expression pattern.

go_sim

A symmetric 100 by 100 matrix of simulated Gene Ontology semantic similarity values in [0, 1]. Genes within the same biological process exhibit high similarity (~0.85), genes across processes exhibit low similarity (~0.10).

process

A factor of length 100 with five levels (p1 through p5) encoding the true biological process membership of each gene. This is the ground truth and should be used only for evaluation, not for clustering.

Details

The dataset was generated by data-raw/generate_geneexpr.R using set.seed(2026). Each of the five biological processes has a distinct temporal expression prototype (sinusoids, cosinusoids, and a linear trend across 20 conditions), and the 20 genes belonging to each process are sampled around the prototype with Gaussian noise (sigma = 0.5).

The go_sim matrix simulates the kind of semantic similarity produced by tools such as GOSemSim, where genes annotated to the same GO terms share high pairwise similarity. To use it as a distance, convert it via 1 - go_sim.

Typical workflow with moc.gapbk:

data(geneexpr)
d_expr <- as.matrix(stats::dist(geneexpr$expression, method = "euclidean"))
d_go   <- 1 - geneexpr$go_sim
res    <- moc.gapbk(d_expr, d_go, num_k = 5)

Source

Synthetic data generated by the script data-raw/generate_geneexpr.R bundled with the package.

Examples

data(geneexpr)
str(geneexpr)

# The five biological processes
table(geneexpr$process)

# Build the two distance matrices required by moc.gapbk
d_expr <- as.matrix(stats::dist(geneexpr$expression, method = "euclidean"))
d_go   <- 1 - geneexpr$go_sim

# Quick clustering (low parameters for the example)
set.seed(42)
res <- moc.gapbk(d_expr, d_go, num_k = 5,
                 generation = 5, pop_size = 6)
head(res$matrix.solutions)

Multi-Objective Clustering Guided by a-Priori Biological Knowledge (MOC-GaPBK)

Description

Performs the MOC-GaPBK algorithm proposed by Parraga-Alava and others (2018). It receives two distance matrices and returns a set of non-dominated clustering solutions.

Usage

moc.gapbk(
  dmatrix1,
  dmatrix2,
  num_k,
  generation = 50,
  pop_size = 10,
  rat_cross = 0.8,
  rat_muta = 0.01,
  tour_size = 2,
  neighborhood = 0.1,
  local_search = FALSE,
  cores = 2
)

moc.gabk(...)

Arguments

dmatrix1

A square distance matrix. Must have the same dimensions as dmatrix2.

dmatrix2

A square distance matrix. Must have the same dimensions as dmatrix1. Typically encodes a-priori biological knowledge.

num_k

The number kk of clusters represented by medoids in each individual. Must be greater than 1.

generation

Number of generations to be performed. Default 50.

pop_size

Size of the population. Default 10.

rat_cross

Probability of crossover. Default 0.80.

rat_muta

Probability of mutation. Default 0.01.

tour_size

Size of the tournament for parent selection. Default 2.

neighborhood

Percentage of neighborhood used by Pareto Local Search. A real value between 0 and 1. The neighborhood size is computed as neighborhood * num_objects. Default 0.10.

local_search

Logical. If TRUE, Path-Relinking (PR) and Pareto Local Search (PLS) are applied as intensification and diversification strategies. Default FALSE.

cores

Number of cores used by Path-Relinking. Default 2.

...

Arguments passed to moc.gapbk.

Details

MOC-GaPBK couples NSGA-II with Path-Relinking and Pareto Local Search. Two versions of the Xie-Beni validity index are used as objectives, one per distance matrix.

moc.gabk (note the single p) is a deprecated alias kept for backward compatibility with versions 0.1.x. New code should call moc.gapbk directly.

Value

A named list with three elements:

population

A data frame containing the final population of medoids together with the values of the two objective functions, the Pareto ranking and the crowding distance, ordered accordingly.

matrix.solutions

A data frame whose columns are clustering solutions on the Pareto front. Each row corresponds to an object and each cell to its assigned cluster.

clustering

A list of named integer vectors. Element i is the partition produced by the ii-th solution on the Pareto front.

Author(s)

Jorge Parraga-Alava, Marcio Dorn, Mario Inostroza-Ponta

References

J. Parraga-Alava, M. Dorn, M. Inostroza-Ponta (2018). A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Mining. 11(1) 1-16. doi:10.1186/s13040-018-0178-4.

K. Deb, A. Pratap, S. Agarwal, T. Meyarivan (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2) 182-197.

F. Glover (1997). Tabu Search and Adaptive Memory Programming Advances, Applications and Challenges. Interfaces in Computer Science and Operations Research. 1-75.

J. Dubois-Lacoste, M. Lopez-Ibanez, T. Stutzle (2015). Anytime Pareto local search. European Journal of Operational Research, 243(2) 369-385.

Examples

set.seed(1)
x <- matrix(stats::runif(50 * 20, min = -5, max = 10),
            nrow = 50, ncol = 20)

dmatrix1 <- as.matrix(stats::dist(x, method = "euclidean"))
dmatrix2 <- as.matrix(stats::dist(x, method = "manhattan"))

res <- moc.gapbk(dmatrix1, dmatrix2, num_k = 3,
                 generation = 5, pop_size = 6)

head(res$matrix.solutions)