PREPRINT: VCF2Prot: An Efficient and Parallel Tool for Generating Personalized Proteomes from VCF Files

Title of publication: 

PREPRINT: VCF2Prot: An Efficient and Parallel Tool for Generating Personalized Proteomes from VCF Files

Authors: 

Hesham ElAbd, Frauke Degenhardt, Tobias L. Lenz, Andre Franke, Mareike Wendorff

Year of Publication: 

PREPRINT: 2022, Jan 21

medium resp. publishing house / place: 

bioRxiv

related to project: 

Motivation The ability to generate sample-specific protein sequences is a crucial step in neo-antigen discovery, cancer vaccine development, and proteogenomics. The revolutionary increase in the throughput of sequencers has fueled large-scale genomic and transcriptomic studies, holding great promises for the emerging field of personalized medicine. However, most sequencing projects store their sequencing data in an abbreviated variant calling format (VCF) that is not immediately amenable to subsequent proteomic and peptidomic analyses. Furthermore, data processing of such increasingly massive genome-scale datasets calls for parallel and concurrent programming, and consequently refactoring of existing algorithms and/or the development of new parallel algorithms.

Results Here, we introduce sequence intermediate representation (SIR), a novel and generic algorithm for generating personalized or sample-specific protein sequences from a consequence-called VCF file and the corresponding reference proteome. An implementation of SIR, named VCF2Prot, was developed to aid personalized medicine and proteogenomics by generating personalized proteomes in FASTA format from a collection of consequence-called genomic alterations stored in a VCF file. Benchmarking VCF2Prot against the recently published PrecisionProDB showed an ~1000-fold improvement in runtime (depending on the input size). Furthermore, in a scale-up study VCF2Prot processed a VCF file containing 99,254 variants observed across 8,192 patients in ~ 11 minutes, demonstrating the massive improvement in the execution speed and the utility of SIR and VCF2prot in bridging large-scale genomic and proteomic studies.

Availability and Implementation VCF2Prot comes with a permissive MIT-license, enabling the commercial and non-commercial utilization of the tool. The source code along with precompiled versions for Linux/Mac OS are available at https://github.com/ikmb/vcf2prot. The modular units used for building VCF2Prot are available as a Rust crate at https://crates.io/crates/ppgg with documentations and examples at https://docs.rs/ppgg/0.1.4/ppgg/ under the same MIT-license.