GSoC 2026 · NumFOCUS · scikit-bio

GSoC 2026: Bringing Numba-Optimized Performance to scikit-bio

I was accepted into Google Summer of Code 2026 with NumFOCUS/scikit-bio for a 350-hour project replacing performance-critical Cython paths with Numba CPU and GPU implementations.

scikit-bio ↗ NumFOCUS GSoC →
350hLarge project
2.4xPrototype speedup
<1e-15Numerical agreement
May 25Coding starts

Accepted into GSoC 2026

On May 1, 2026, Google notified me that my proposal with NumFOCUS had been accepted for Google Summer of Code 2026. The accepted project title is Numba Optimized Implementations for scikit-bio's Performance-Critical Functions.

The project is under scikit-bio, a BSD-licensed Python library for biological omic data analysis. It sits inside the NumFOCUS ecosystem and is used in workflows around distance matrices, ordination, compositional data analysis, diversity metrics, and biological sequence analysis.

This is not a generic performance project. The work targets the computational core behind PCoA, Mantel tests, PERMANOVA, and related bioinformatics workflows where large biological datasets make performance and memory behavior matter.

The Problem: Fast CPU Code, GPU-Blind Internals

scikit-bio currently has two important execution paths for numerical code:

PathStrengthLimitation
CythonFast on CPU with single-pass memory patternsCannot directly operate on GPU-backed arrays because typed memoryviews require host memory
Array API / xpWorks across NumPy, CuPy, JAX, PyTorch, and Dask-style backendsOften slower on CPU because vectorized operations make multiple passes through the matrix

That creates a divergence problem: the fastest CPU code paths are locked behind Cython, while the GPU-capable paths live in separate pure Array API implementations. Over time, two parallel implementations become harder to maintain and easier to subtly diverge.

The Proposed Solution: One Source, Three Execution Paths

The accepted proposal replaces selected Cython functions with Numba implementations. Numba lets us write Python code that compiles to optimized native CPU code via LLVM, while also giving an explicit CUDA kernel path for GPU execution.

1. Numba CPU

@njit(parallel=True, cache=True) for NumPy arrays on CPU. This is the default fast path and should match or exceed the existing Cython implementation.

2. Numba CUDA

@cuda.jit for memory-bandwidth-bound operations where a fused kernel beats multiple Array API kernels on GPU-backed arrays.

3. Array API fallback

The xp path remains the safety net for backends like JAX, PyTorch, Dask, or environments where Numba is not installed.

Numba will be an optional dependency. If it is unavailable, users still get correct results through the existing Array API-compatible fallback path.

The Prototype That Made the Proposal Concrete

The core prototype was for center_distance_matrix, which is used in Principal Coordinates Analysis (PCoA). The algorithm performs Gower centering: square the distance matrix, accumulate row means, compute a global mean, and apply centering.

center_distance_matrix Numba shape
from numba import njit, prange
import numpy as np

@njit(parallel=True)
def center_distance_matrix_numba(D):
    n = D.shape[0]
    E = np.empty((n, n), dtype=D.dtype)
    row_means = np.empty(n, dtype=D.dtype)
    global_sum = 0.0

    for i in prange(n):
        row_sum = 0.0
        for j in range(n):
            val = -0.5 * D[i, j] * D[i, j]
            E[i, j] = val
            row_sum += val
        row_means[i] = row_sum / n
        global_sum += row_sum

    global_mean = global_sum / (n * n)
    for i in prange(n):
        for j in range(n):
            E[i, j] = E[i, j] - row_means[i] - row_means[j] + global_mean
    return E

The important part is not the decorator alone. It is the memory-access pattern: one pass for squaring and row accumulation, then one pass for centering. That mirrors the Cython advantage while keeping the implementation in Python.

Matrix sizeCythonNumba @njit(parallel=True)
n = 5000.39 ms0.34 ms
n = 10001.75 ms1.01 ms
n = 300019.3 ms7.91 ms

At n=3000, the Numba prototype was 2.4x faster than Cython in the same benchmark session, with numerical agreement below 1e-15.

What I Will Build

The accepted scope focuses on the distance and ordination modules: the places where performance-critical matrix operations appear repeatedly.

Tier 1

Ordination Cython replacements

Replace e_matrix_means_cy, f_matrix_inplace_cy, and center_distance_matrix_cy. These underpin PCoA and are the direct continuation of the working prototype.

Tier 2

Distance module Cython replacements

Implement Numba versions of Mantel permutation correlation, PERMANOVA pseudo-F statistics, and geometric median routines where the loop and reduction patterns fit Numba.

Tier 3

Distance utilities

Port validation and matrix reordering utilities such as is_symmetric_and_hollow and distmat_reorder where straightforward Numba implementations can remove more Cython dependency surface.

Tier 4

Ordination utilities

Add Array API and Numba-aware paths for utilities like mean_and_std, scale, corr, e_matrix, and f_matrix.

Not every Cython file is a good Numba target. Alignment traceback, tree traversal, phylogenetic dictionary lookups, and treap-like metadata structures are explicitly out of scope because they do not match Numba's strengths.

Mentorship

Igor Sfiligoi

Primary mentor from UC San Diego / San Diego Supercomputer Center. His work focuses on HPC, GPU acceleration, cache-aware restructuring, and bioinformatics performance.

Qiyun Zhu

Co-mentor from Arizona State University and a key architect of scikit-bio's Array API migration, including the ingest_array / xp pattern.

Matthew Aton

Co-mentor from Arizona State University working on Array API infrastructure such as decorators for backend-compatible functions and tests.

The technical bar is clear: benchmarks must be honest, numerical claims must be verified, and performance work must respect memory access patterns rather than chasing decorator-based speedups blindly.

Timeline

1

Community Bonding: May 1 - May 24

Set up the development environment, confirm scope, audit Cython files, reproduce prototype benchmarks, study mentor papers, and prepare a warm-up PR.

2

Weeks 1-6: May 25 - July 5

Build Numba integration infrastructure, replace ordination Cython functions, implement Mantel and PERMANOVA Numba paths, and prepare for midterm evaluation.

3

Midterm: July 6 - July 10

Target state: Tier 1 complete and merged, with Mantel and PERMANOVA implementations ready for review.

4

Weeks 7-12: July 11 - August 17

Finish utilities, GPU kernels, benchmark tables, documentation, review cleanup, final report, and final blog post.

Why This Matters

scikit-bio is not an isolated library. It supports downstream scientific workflows in ecosystems like QIIME 2, Emperor, and Qiita. When performance-critical matrix operations become faster and more GPU-capable, the benefit can propagate into real biological data analysis pipelines.

The engineering lesson is bigger than one library: modern scientific Python needs graceful heterogeneity. CPU should be fast. GPU should be possible. Fallback paths should be correct. Maintainers should not have to carry multiple drifting implementations forever.

My goal for the summer is simple: make scikit-bio's fastest paths easier to maintain, easier to benchmark, and ready for CPU and GPU execution without compromising numerical correctness.