HPC in cancer research: New paths toward identifying potential biomarkers

The digital revolution has arrived with force in the field of biomedicine, and one of its most promising applications is the discovery of potential biomarkers for personalised cancer diagnosis and treatment. In an era where genomic data is not only vast but also increasingly complex, traditional analytical methods often prove insufficient. The demand for faster, more accurate, and scalable solutions has brought two transformative technologies into the spotlight: High Performance Computing (HPC) and Artificial Intelligence (AI).

Their convergence is not merely technological—it represents a paradigm shift. The question is no longer whether they can change the way we approach cancer research but how deeply they are already transforming it. And in this transformation, biomarker discovery is one of the most synergistic frontiers.

What are biomarkers and why are they so important?

Biomarkers are measurable indicators of a biological condition, and in oncology, they are often genes whose expression changes in the presence of a tumor, proteins involved in immune regulation, or molecular elements linked to cancer development and progression. Their identification is fundamental to what we now call precision medicine. Biomarkers can signal the presence of a tumor before clinical symptoms appear, differentiate between subtypes of cancer that respond differently to treatments, help anticipate the efficacy of therapies, or inform the design of drugs tailored to specific molecular profiles among others.

However, the search for these biomarkers is complex. It involves examining the expression of tens of thousands of genes across several patient samples, comparing tumor and control tissues, and extracting patterns that are not only statistically significant but also biologically meaningful. The high dimensionality, noise, and heterogeneity of omics data make this a computationally intensive task. That is precisely where HPC and AI become essential allies.

HPC + AI: The computational alliance driving precision medicine

While HPC provides the raw power to handle massive datasets in reduced timeframes, AI contributes intelligence, adaptability, and the capacity to uncover patterns hidden in noise. Together, they represent a new way of doing science—one that does not rely solely on hypothesis-driven exploration but also leverages data-driven discovery.

By combining scalable computing environments with machine learning algorithms, researchers can now construct predictive models that learn from thousands of features, detect nonlinear relationships among genes, and even anticipate which genetic alterations may be driving a particular type of cancer. From traditional supervised methods like support vector machines or random forests to more complex deep learning approaches such as autoencoders or artificial neural networks, these tools are being adapted to transcriptomic data at a scale previously unimaginable. Unsupervised methods such as gene biclustering, evolutionary algorithms, and gene co-expression networks also benefit from this computational symbiosis.

An example used in our research was pyEnGNet, a gene co-expression network algorithm designed specifically for parallel execution in multi-CPU and multi-GPU environments. Its ability to extract biologically coherent gene modules from transcriptomic datasets—by correlating expression patterns and evaluating network topology—has been instrumental in biomarker discovery across several cancer types.

Real-world cases: Sarcomas and the discovery of biomarkers with HPC and AI

This fusion of HPC and AI is not just theoretical—it is already being applied to real-world cases, such as the study of sarcomas. These rare and aggressive cancers present a unique challenge due to their genetic heterogeneity and limited available data. Nevertheless, researchers from the SIALAB team have successfully deployed AI-HPC pipelines to investigate various sarcoma subtypes, including leiomyosarcoma (LMS), Malignant Peripheral Nerve Sheath Tumor (MPNST), osteosarcoma, and Ewing’s sarcoma.

By working with transcriptomic data from RNA-Seq and microarrays, the team implemented a multi-stage methodology that included differential expression analysis, network-based modeling, and intelligent prioritization of candidate genes. These pipelines were executed on multi-GPU architectures to drastically reduce computational time. The combination of graph-theoretic metrics and machine learning led to the identification of several promising biomarkers. For example, in LMS, the genes CSF1R and SOX9 stood out. In MPNST, IKZF3, RXRA, E2F3, and TBX19 were prioritized. In Ewing’s sarcoma, the team found COL11A1, VCAN, BUB1B, CDC20, UBE2C, and AURKA as relevant candidates. For osteosarcoma, transcription factors like NKX2-1, TAL1, GFI1, and IKZF1 emerged as key elements. Importantly, many of these genes had never been linked to these specific subtypes before, opening the door to further clinical validation and potentially life-changing treatments.

A methodology driven by HPC

The success of these findings lies in a well-structured methodology, where each phase has been designed to take full advantage of HPC and AI resources. The pipeline begins with preprocessing and normalization of the raw transcriptomic data to ensure high quality. Then, differential expression analysis is performed using statistical and machine learning techniques to identify significant changes between tumor and control samples.

Next comes the construction of gene co-expression networks, which are accelerated through multiprocessing in GPU environments. These networks are then subjected to topological analysis and compared with known biological interaction databases using metrics such as Gene Network Coherence (GNC). Finally, machine learning models analyze the networks, identify hub genes, detect modules of interest, and prioritize potential biomarkers for further study. Most of this process has been implemented using bioScience, a custom Python-based HPC library tailored for large-scale omics analysis.

Current challenges and future opportunities

Despite these advances, the field still faces critical challenges. One of the most urgent is the development of more interpretable AI models. In clinical settings, it is not enough for an algorithm to provide accurate predictions; it must also explain the rationale behind its decisions in a way that is understandable and trustworthy to physicians and researchers.

Another challenge lies in the integration of multi-omic data. Cancer is not only a genomic disease—it is also shaped by epigenetic modifications, transcriptomic shifts, proteomic dynamics, and metabolic context. Incorporating all of these layers into a unified analytical framework will require even more powerful HPC infrastructures and more sophisticated AI models.

The validation of computational biomarkers in clinical environments is also a crucial step. Algorithms may suggest compelling candidates, but only experimental evidence in large, diverse cohorts can confirm their true relevance. Looking ahead, the most exciting opportunity may come from adaptive systems—models that automatically adjust their thresholds, algorithms, and feature selection strategies depending on the cancer subtype or the characteristics of the input data.

The fusion of HPC and AI is not just accelerating the pace of research; it is changing its nature. We are moving from static models to dynamic, self-improving systems capable of learning directly from the biology of disease. This marks a profound shift: from treating cancer as a generic disease to understanding it as a molecularly diverse set of conditions, each with its own vulnerabilities and therapeutic windows.

Scientific Research

Disciplines

Publications

Projects

Source code

Industry

Projects and contracts for knowledge transfer and exchange

Networking

Industry

Affiliated members

Affiliated institutions

Join us

We are open to new scientific collaborations. Let’s talk about science together.