当前位置: X-MOL 学术arXiv.q-bio.GN › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes
arXiv - QuanBio - Genomics Pub Date : 2023-01-22 , DOI: arxiv-2301.09200
Can Firtina, Nika Mansouri Ghiasi, Joel Lindegger, Gagandeep Singh, Meryem Banu Cavlak, Haiyu Mao, Onur Mutlu

Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either 1) require powerful computational resources that may not be available for portable sequencers or 2) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: 1) read mapping, 2) relative abundance estimation, and 3) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better average throughput and 2) an average speedup of 32.1x and 2.1x in the mapping time, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash.

中文翻译:

RawHash:实现大型基因组原始纳米孔信号的快速准确实时分析

纳米孔测序仪在对长基因组链进行测序时实时生成电原始信号。这些原始信号可以在生成时进行分析,从而为实时基因组分析提供了机会。Read Until 是纳米孔测序的一个重要特征,它可以在没有完全测序的情况下从测序仪中弹出链,这提供了在计算上减少测序时间和成本的机会。然而,利用 Read Until 的现有工作要么 1) 需要强大的计算资源,而便携式测序仪可能无法使用这些资源,要么 2) 缺乏大型基因组的可扩展性,导致它们不准确或无效。我们提出了 RawHash,这是第一个可以使用基于哈希的相似性搜索对大型基因组的纳米孔原始信号进行准确高效实时分析的机制。要启用此功能,RawHash 确保对应于相同 DNA 内容的信号产生相同的哈希值,而不管这些信号的细微变化。RawHash 通过对原始信号的有效量化实现了基于哈希的精确相似性搜索,使得对应于相同 DNA 内容的信号具有相同的量化值,随后具有相同的哈希值。我们在三个应用程序上评估 RawHash:1) 读取映射,2) 相对丰度估计,以及 3) 污染分析。我们的评估表明,RawHash 是唯一可以为实时分析大型基因组提供高精度和高吞吐量的工具。与最先进的技术 UNCALLED 和 Sigmap 相比,RawHash 提供 1) 25.8 倍和 3.4 倍的平均吞吐量以及 2) 映射时间的平均加速 32.1 倍和 2.1 倍,分别。源代码可在 https://github.com/CMU-SAFARI/RawHash 获得。
更新日期:2023-01-24
down
wechat
bug