Applying Data-Centric AI to Improve a Single-cell RNA-seq Pipeline
Recent advancements in genomics have produced technology that can determine individual cell expression. While the end result is a simple matrix of gene and cell counts, the general process to produce this data is complex, noisy, and unsettled. Current approaches use pipelines involving data mining elements–in particular, clustering. Single-Cell Consensus Clustering (SC3) is the state of the art that uses traditional kmeans in its pipeline. Inspired by very recent data-centric AI initiatives, this work examines whether a new data-centric kmeans can significantly improve this pipeline when placed in a real-world scenario. The results show that the data-centric approach significantly improves the SC3 pipeline by reducing the computational load spend on k-means clustering. We are able to build a data-centric k-means that keeps track of so-called data expressiveness, potentially ignoring a significant number of irrelevant computations while identical to k-means otherwise. We also prove that k-means-d is identical to k-means in accuracy although potentially using fewer pairwise distance computations.