FastImpute: Development and Validation of a Workflow for Open-source, Reference-Free Genotype Imputation Methods - An Example in Breast Cancer (PRS313_BC)
Abstract
Introduction
Genotype imputation improves the resolution of genetic data, but traditional methods are computationally intensive or compromise privacy. Deep learning alternatives are often too large for client-side deployment. In this study, FastImpute, a workflow for creating lightweight, reference-free imputation models, was developed that enables real-time, accessible genetic risk assessment on edge devices.
Methods
Using whole-genome sequencing data from 2,504 individuals in the 1000 Genomes Project, linear and logistic regression models were trained to impute single-nucleotide polymorphisms (SNPs) used in the breast cancer polygenic risk score PRS313_BC. Models used SNPs from commercial genotyping arrays, and performance was evaluated against sequencing data and benchmarked against Beagle.
Results
The polygenic risk score (PRS) calculated with our linear model correlated strongly with the PRS from true sequencing data (R² = 0.86), significantly outperforming no imputation and minor allele frequency imputation (R² = 0.38). Our logistic model correctly identified 4 of 6 individuals in the top 1% of breast cancer risk, matching Beagle’s performance.
Discussion
Our approach balances performance and efficiency, enabling deployment on personal devices and preserving user privacy through local data processing. This approach democratizes access to genetic risk assessment using direct-to-consumer data. However, this proof of concept requires validation across other genomic contexts before clinical use.
Conclusion
The FastImpute pipeline demonstrates that lightweight models can enable real-time genetic risk assessment on edge devices.
