Applying Data-Centric AI to Improve a Single-cell RNA-seq Pipeline

Recent advancements in genomics have produced technology that can determine individual cell expression. While the end result is a simple matrix of gene and cell counts, the general process to produce this data is complex, noisy, and unsettled. Current approaches use pipelines involving data mining elements–in particular, clustering. Single-Cell Consensus Clustering (SC3) is the state of the art that uses traditional kmeans in its pipeline. Inspired by very recent data-centric AI initiatives, this work examines whether a new data-centric kmeans can significantly improve this pipeline when placed in a real-world scenario. The results show that the data-centric approach significantly improves the SC3 pipeline by reducing the computational load spend on k-means clustering. We are able to build a data-centric k-means that keeps track of so-called data expressiveness, potentially ignoring a significant number of irrelevant computations while identical to k-means otherwise. We also prove that k-means-d is identical to k-means in accuracy although potentially using fewer pairwise distance computations.

Hasan Kurban
Hasan Kurban
Computer & Data Scientist

I’m a computer scientist & machine learning researcher who loves building intelligent systems to find data-driven solutions to real-world problems.