e-ISSN 2231-8526
ISSN 0128-7680
Kristine Sandra Pey Adum and Hasni Arsad
Pertanika Journal of Science & Technology, Volume 30, Issue 4, October 2022
DOI: https://doi.org/10.47836/pjst.30.4.24
Keywords: Alignment, HISAT2, novoalign, RNA-seq, subread, TopHat
Published on: 28 September 2022
The introduction of RNA-sequencing (RNA-Seq) technology into biological research has encouraged bioinformatics developers to build various analysis pipelines. The chosen bioinformatics pipeline mostly depends on the research goals and organisms of interest because a single pipeline may not be optimal for all cases. As the first step in most pipelines, alignment has become a crucial step that will affect the downstream analysis. Each alignment tool has its default and parameter settings to maximise the output. However, this poses great challenges for the researchers as they need to determine the alignment tool most compatible with the correct settings to analyse their samples accurately and efficiently. Therefore, in this study, the duplication of real data of the HeLa RNA-seq was used to evaluate the effects of data qualities on four commonly used RNA-Seq tools: HISAT2, Novoalign, TopHat and Subread. Furthermore, these data were also used to evaluate the optimal settings of each aligner for our sample. These tools’ performances, precision, recall, F-measure, false discovery rate, error tolerance, parameter stability, runtime and memory requirements were measured. Our results showed significant differences between the settings of each alignment tool tested. Subread and TopHat exhibited the best performance when using optimised parameters setting. In contrast, the most reliable performance was observed for HISAT2 and Novoalign when the default setting was used. Although HISAT2 was the fastest alignment tool, the highest accuracy was achieved using Novoalign with the default setting.
Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. Babraham Bioinformatics. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Baruzzo, G., Hayer, K. E., Kim, E. J., Di Camillo, B., Fitzgerald, G. A., & Grant, G. R. (2017). Simulation-based comprehensive benchmarking of RNA-seq aligners. Nature Methods, 14(2), 135-139. https://doi.org/10.1038/nmeth.4106
Bottomley, R. H., Trainer, A. L., & Griffin, M. J. (1969). Enzymatic and chromosomal characterization of HeLa variants. The Journal of Cell Biology, 41(3), 806-815. https://doi.org/10.1083/jcb.41.3.806
Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884-i890. https://doi.org/10.1093/bioinformatics/bty560
Chen, X., Robinson, D. G., & Storey, J. D. (2021). The functional false discovery rate with applications to genomics. Biostatistics, 22(1), 68-81. https://doi.org/10.1093/biostatistics/kxz010
Donato, L., Scimone, C., Rinaldi, C., D’Angelo, R., & Sidoti, A. (2021). New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: An updated comparison of DNA- and RNA-seq data from Illumina and Ion Torrent technologies. Neural Computing and Applications, 33(22), 15669-15692. https://doi.org/10.1007/s00521-021-06188-z
Fasterius, E., & Al-Khalili Szigyarto, C. (2018). Analysis of public RNA-sequencing data reveals biological consequences of genetic heterogeneity in cell line populations. Scientific Reports, 8(1), 1-11. https://doi.org/10.1038/s41598-018-29506-3
Ferragina, P., & Manzini, G. (2000). Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science (pp. 390-398). IEEE Publishing. https://doi.org/10.1109/sfcs.2000.892127
Fonseca, N. A., Rung, J., Brazma, A., & Marioni, J. C. (2012). Tools for mapping high-throughput sequencing data. Bioinformatics, 28(24), 3169-3177. https://doi.org/10.1093/bioinformatics/bts605
Gaur, P., & Chaturvedi, A. (2017). A survey of bioinformatics-based tools in RNA-sequencing (RNA-seq) data analysis. In Translational Bioinformatics and its Application (pp. 223-248). Springer. https://doi.org/10.1007/978-94-024-1045-7_10
Goodwin, S., McPherson, J. D., & McCombie, W. R. (2016). Coming of age: Ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6), 333-351. https://doi.org/10.1038/nrg.2016.49
Grytten, I., Rand, K. D., Nederbragt, A. J., & Sandve, G. K. (2020). Assessing graph-based read mappers against a novel baseline approach highlights strengths and weaknesses of the current generation of methods. BMC Genomics, 21, Article 282. https://doi.org/10.1186/s12864-020-6685-y
Hu, W. E., Zhang, X., Guo, Q. F., Yang, J. W., Yang, Y., Wei, S. C., & Su, X. D. (2019). HeLa-CCL2 cell heterogeneity studied by single-cell DNA and RNA sequencing. PLoS One, 14(12), Article e0225466. https://doi.org/10.1371/journal.pone.0225466
Jain, C., Rhie, A., Zhang, H., Chu, C., Walenz, B. P., Koren, S., & Phillippy, A. M. (2020). Weighted minimizer sampling improves long read mapping. Bioinformatics, 36, I111-I118. https://doi.org/10.1093/BIOINFORMATICS/BTAA435
Keel, B. N., & Snelling, W. M. (2018). Comparison of Burrows-Wheeler transform-based mapping algorithms used in high-throughput whole-genome sequencing: Application to illumina data for livestock genomes 1. Frontiers in Genetics, 9, 1-6. https://doi.org/10.3389/fgene.2018.00035
Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: A fast spliced aligner with low memory requirements. Nature Methods, 12(4), 357-360. https://doi.org/10.1038/nmeth.3317
Koboldt, D. C. (2020). Best practices for variant calling in clinical sequencing. Genome Medicine, 12(1), 1-13. https://doi.org/10.1186/s13073-020-00791-w
Križanović, K., Echchiki, A., Roux, J., & Šikić, M. (2018). Evaluation of tools for long read RNA-seq splice-aware alignment. Bioinformatics, 34(5), 748-754. https://doi.org/10.1093/bioinformatics/btx668
Landman, S. R., Hwang, T. H., Silverstein, K. A. T., Li, Y., Dehm, S. M., Steinbach, M., & Kumar, V. (2014). SHEAR: Sample heterogeneity estimation and assembly by reference. BMC Genomics, 15(1), 1-12. https://doi.org/10.1186/1471-2164-15-84
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079. https://doi.org/10.1093/bioinformatics/btp352
Liao, Y., Smyth, G. K., & Shi, W. (2013). The subread aligner: Fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10), e108-e108. https://doi.org/10.1093/nar/gkt214
Liu, Y., Mi, Y., Mueller, T., Kreibich, S., Williams, E. G., Van Drogen, A., Borel, C., Frank, M., Germain, P. L., Bludau, I., Mehnert, M., Seifert, M., Emmenlauer, M., Sorg, I., Bezrukov, F., Bena, F. S., Zhou, H., Dehio, C., Testa, G., & Aebersold, R. (2019). Multi-omic measurements of heterogeneity in HeLa cells across laboratories. Nature Biotechnology, 37(3), 314-322. https://doi.org/10.1038/s41587-019-0037-y
Nodehi, H. M., Tabatabaiefar, M. A., & Sehhati, M. (2021). Selection of optimal bioinformatic tools and proper reference for reducing the alignment error in targeted sequencing data. Journal of Medical Signals and Sensors, 11(1), 37-44. https://doi.org/10.4103/jmss.JMSS-7-20
Qin, D. (2019). Next-generation sequencing and its clinical application. Cancer Biology and Medicine, 16(1), 4-10. https://doi.org/10.20892/j.issn.2095-3941.2018.0055
Raplee, I. D., Evsikov, A. V., & De Evsikova, C. M. (2019). Aligning the aligners: Comparison of rna sequencing data alignment and gene expression quantification tools for clinical breast cancer research. Journal of Personalized Medicine, 9(2), Article 18. https://doi.org/10.3390/jpm9020018
Rutledge, S. (2014). What HeLa cells are you using? The Winnower, 9, 1-9. https://doi.org/10.15200/winn.143896.65158
Sahlin, K., & Mäkinen, V. (2021). Accurate spliced alignment of long RNA sequencing reads. Bioinformatics, 37(24), 4643-4651. https://doi.org/10.1093/bioinformatics/btab540
Sahraeian, S. M. E., Mohiyuddin, M., Sebra, R., Tilgner, H., Afshar, P. T., Au, K. F., Bani Asadi, N., Gerstein, M. B., Wong, W. H., Snyder, M. P., Schadt, E., & Lam, H. Y. K. (2017). Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nature Communications, 8(1), 1-14. https://doi.org/10.1038/s41467-017-00050-4
Schaarschmidt, S., Fischer, A., Zuther, E., & Hincha, D. K. (2020). Evaluation of seven different RNA-seq alignment tools based on experimental data from the model plant Arabidopsis thaliana. International Journal of Molecular Sciences, 21(5), Article 1720. https://doi.org/10.3390/ijms21051720
Schilbert, H. M., Rempel, A., & Pucker, B. (2020). Comparison of read mapping and variant calling tools for the analysis of plant NGS data. Plants, 9(4), Article 439. https://doi.org/10.3390/plants9040439
Shang, J., Zhu, F., Vongsangnak, W., Tang, Y., Zhang, W., & Shen, B. (2014). Evaluation and comparison of multiple aligners for next-generation sequencing data analysis. BioMed Research International, 2014, Article 309650. https://doi.org/10.1155/2014/309650
Sun, Z., Bhagwate, A., Prodduturi, N., Yang, P., & Kocher, J. P. A. (2017). Indel detection from RNA-seq data: Tool evaluation and strategies for accurate detection of actionable mutations. Briefings in Bioinformatics, 18(6), 973-983. https://doi.org/10.1093/bib/bbw069
Thankaswamy-Kosalai, S., Sen, P., & Nookaew, I. (2017). Evaluation and assessment of read-mapping by multiple next-generation sequencing aligners based on genome-wide characteristics. Genomics, 109(3-4), 186-191. https://doi.org/10.1016/j.ygeno.2017.03.001
Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: Discovering splice junctions with RNA-seq. Bioinformatics, 25(9), 1105-1111. https://doi.org/10.1093/bioinformatics/btp120
Wu, D. C., Yao, J., Ho, K. S., Lambowitz, A. M., & Wilke, C. O. (2018). Limitation of alignment-free tools in total RNA-seq quantification. BMC Genomics, 19(1), 1-14. https://doi.org/10.1101/246967
Yoo, Y. S., Han, H. G., & Jeon, Y. J. (2017). Unfolded protein response of the endoplasmic reticulum in tumor progression and immunogenicity. Oxidative Medicine and Cellular Longevity, 2017, Article 2969271. https://doi.org/10.1155/2017/2969271
Zhang, C., Zhang, B., Lin, L. L., & Zhao, S. (2017). Evaluation and comparison of computational tools for RNA-seq isoform quantification. BMC Genomics, 18(1), 1-11. https://doi.org/10.1186/s12864-017-4002-1
Zhou, Q., Su, X., Jing, G., Chen, S., & Ning, K. (2018). RNA-QC-chain: Comprehensive and fast quality control for RNA-Seq data. BMC Genomics, 19(1), 1-10. https://doi.org/10.1186/s12864-018-4503-6
ISSN 0128-7680
e-ISSN 2231-8526