TY - JOUR
T1 - Highly accurate assembly polishing with DeepPolisher
AU - Human Pangenome Reference Consortium
AU - Mastoras, Mira
AU - Asri, Mobin
AU - Brambrink, Lucas
AU - Hebbar, Prajna
AU - Kolesnikov, Alexey
AU - Cook, Daniel E.
AU - Nattestad, Maria
AU - Lucas, Julian
AU - Won, Taylor S.
AU - Chang, Pi Chuan
AU - Carroll, Andrew
AU - Paten, Benedict
AU - Shafin, Kishwar
AU - Tayoun, Ahmad Abou
AU - Albracht, Derek
AU - Allen, Jamie
AU - Alsheikh-Ali, Alawi A.
AU - Andrews, Casey
AU - Antipov, Dmitry
AU - Antonacci-Fulton, Lucinda
AU - Asri, Mobin
AU - Ayllon, Marcelo
AU - Balacco, Jennifer R.
AU - Belter, Edward A.
AU - Bender, Halle D.
AU - Blair, Andrew P.
AU - Buonaiuto, Silvia
AU - Bolognini, Davide
AU - Bonini, Katherine E.
AU - Boucher, Christina
AU - Bourque, Guillaume
AU - Cao, Shuo
AU - Carroll, Andrew
AU - Mc Cartney, Ann M.
AU - Cechova, Monika
AU - Chang, Pi Chuan
AU - Chang, Xian
AU - Cheema, Jitender
AU - Cheng, Haoyu
AU - Ciofi, Claudio
AU - Cody, Sarah
AU - Colonna, Vincenza
AU - Conwell, Holland C.
AU - Cook-Deegan, Robert
AU - Diekhans, Mark
AU - Diroma, Maria Angela
AU - Doerr, Daniel
AU - Dong, Zheng
AU - Durbin, Richard
AU - Ebler, Jana
AU - Eichler, Evan E.
AU - Eizenga, Jordan M.
AU - Eskandar, Parsa
AU - Ferro, Eddie
AU - Fiston-Lavier, Anna Sophie
AU - Ford, Sarah M.
AU - Ford, Willard W.
AU - Formenti, Giulio
AU - Frankish, Adam
AU - Freeberg, Mallory A.
AU - Fu, Qichen
AU - Fullerton, Stephanie M.
AU - Fulton, Robert S.
AU - Gao, Yan
AU - Garcia, Gage H.
AU - Garcia, Obed A.
AU - Gardner, Joshua M.V.
AU - Garg, Shilpa
AU - Garrison, Erik
AU - Garrison, Nanibaa A.
AU - Garza, John
AU - Ghorbani, Mohammadmersad
AU - Graves-Lindsay, Tina
AU - Green, Richard E.
AU - Groza, Cristian
AU - Guarracino, Andrea
AU - Gymrek, Melissa
AU - Haggerty, Leanne
AU - Hall, Ira M.
AU - Hansen, Nancy F.
AU - Hashmi, Mohammad Amiruddin
AU - Haeussler, Maximilian
AU - Haussler, David
AU - Hebbar, Prajna
AU - Heringer, Peter
AU - Hickey, Glenn
AU - Hillaker, Todd L.
AU - Hossain, S. Nakib
AU - Huang, Neng
AU - Hunt, Sarah E.
AU - Hunt, Toby
AU - Jafarzadeh, Nafiseh
AU - Jain, Nivesh
AU - Jarvis, Erich D.
AU - Jiang, Juan
AU - LoTempio, Jonathan
AU - Kenny, Eimear E.
AU - Kim, Juhyun
AU - Koo, Bonhwang
AU - Koren, Sergey
AU - Kremitzki, Milinn
AU - Langmead, Ben
AU - Zhuo, Xiaoyu
AU - Lawson, Heather A.
AU - Li, Daofeng
AU - Li, Heng
AU - Liao, Wen Wei
AU - Lin, Jiadong
AU - Liu, Tianjie
AU - Logsdon, Glennis A.
AU - Lorig-Roach, Ryan
AU - Loucks, Hailey
AU - Loveland, Jane E.
AU - Lu, Jianguo
AU - Lu, Shuangjia
AU - Lucas, Julian K.
AU - Macias-Velasco, Juan F.
AU - Marin, Maximillian G.
AU - Marsico, Franco L.
AU - Makova, Kateryna D.
AU - Markovic, Christopher
AU - Marschall, Tobias
AU - Martin, Fergal J.
AU - Mastoras, Mira
AU - Mayoud, Capucine
AU - McNulty, Brandy
AU - Medico, Jack A.
AU - Menendez, Julian M.
AU - Miga, Karen H.
AU - Minkina, Anna
AU - Mitchell, Matthew W.
AU - Mohanty, Saswat K.
AU - Mokrab, Younes
AU - Monlong, Jean
AU - Moosa, Shabir
AU - Moreno-Ochando, Avelina
AU - Morishita, Shinichi
AU - Mudge, Jonathan M.
AU - Munson, Katherine M.
AU - Mwaniki, Njagi
AU - Nassir, Nasna
AU - Natali, Chiara
AU - Negi, Shloka
AU - Ni, Lingbin
AU - Novak, Adam M.
AU - Owa, Chie
AU - Paez, Sadye
AU - Paten, Benedict
AU - Clawson, Hiram
AU - Peano, Clelia
AU - Phillippy, Adam M.
AU - Pickett, Brandon D.
AU - Pignata, Laura
AU - Pisanti, Nadia
AU - Porubsky, David
AU - Prins, Pjotr
AU - Radhakrishnan, Anandi
AU - Raney, Brian J.
AU - Rautiainen, Mikko
AU - Raveane, Alessandro
AU - Ren, Luyao
AU - Rhie, Arang
AU - Salehi, Farnaz
AU - Sacco, Samuel
AU - Schatz, Michael C.
AU - Scheinfeldt, Laura B.
AU - Sehgal, Aarushi
AU - Seligmann, William E.
AU - Shabani, Mahsa
AU - Shafin, Kishwar
AU - Shahatit, Shadi
AU - Shemirani, Ruhollah
AU - Shivakumar, Vikram S.
AU - Sinha, Swati
AU - Sirén, Jouni
AU - Smeds, Linnéa
AU - Solar, Steven J.
AU - Sollitto, Marco
AU - Soranzo, Nicole
AU - Stergachis, Andrew B.
AU - Suner, Marie Marthe
AU - Suzuki, Yoshihiko
AU - Söylev, Arda
AU - Tierney, Jack A.S.
AU - Tomlinson, Chad
AU - Tricomi, Francesca Floriana
AU - Uddin, Mohammed
AU - Ungaro, Matteo Tommaso
AU - Varki, Rahul
AU - Villani, Flavia
AU - Vollger, Mitchell R.
AU - Walenz, Brian P.
AU - Wang, Charles
AU - Wang, Lisa E.
AU - Wang, Ting
AU - Wenger, Aaron M.
AU - Whelan, Conor V.
AU - Xin, Zilan
AU - Xu, Zheng
AU - Ye, Kai
N1 - Publisher Copyright:
© 2025 Mastoras et al.
PY - 2025/7
Y1 - 2025/7
N2 - Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and underpolishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacific Biosciences (PacBio) HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHAsing Reads in Areas Of Homozygosity (PHARAOH), which uses ultralong Oxford Nanopore Technologies (ONT) data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by approximately half, mostly driven by reductions in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted quality value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.
AB - Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and underpolishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacific Biosciences (PacBio) HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHAsing Reads in Areas Of Homozygosity (PHARAOH), which uses ultralong Oxford Nanopore Technologies (ONT) data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by approximately half, mostly driven by reductions in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted quality value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.
UR - https://www.scopus.com/pages/publications/105010352128
U2 - 10.1101/gr.280149.124
DO - 10.1101/gr.280149.124
M3 - Article
C2 - 40389286
AN - SCOPUS:105010352128
SN - 1088-9051
VL - 35
SP - 1595
EP - 1608
JO - Genome research
JF - Genome research
IS - 7
ER -