Title
|
|
|
|
EPSAPG : a pipeline combining MMseqs2 and PSI-BLAST to quickly generate extensive protein sequence alignment profiles
| |
Author
|
|
|
|
| |
Abstract
|
|
|
|
Numerous machine learning (ML) models employed in protein function and structure prediction depend on evolutionary information, which is captured through multiple-sequence alignments (MSA) or position-specific scoring matrices (PSSM) as generated by PSI-BLAST. Consequently, these predictive methods are burdened by substantial computational demands and prolonged computing time requirements. The principal challenge stems from the necessity imposed on the PSI-BLAST software to load large sequence databases sequentially in batches and then search for sequence alignments akin to a given query sequence. In the case of batch queries, the runtime scales even linearly. The predicament at hand is becoming more challenging as the size of bio-sequence data repositories experiences exponential growth over time and as a consequence, this upward trend exerts a proportional strain on the runtime of PSI-BLAST. To address this issue, an eminent resolution lies in leveraging the MMseqs2 method, capable of expediting the search process by a magnitude of 100. However, MMseqs2 cannot be directly employed to generate the final output in the desired format of PSI-BLAST alignments and PSSM profiles. In this research work, I developed a comprehensive pipeline that synergistically integrates both MMseqs2 and PSI-BLAST, resulting in the creation of a robust, optimized, and highly efficient hybrid alignment pipeline. Notably, the hybrid tool exhibits a significant speed improvement, surpassing the runtime performance of PSI-BLAST in generating sequence alignment profiles by a factor of two orders of magnitude. It is implemented in C++ and is freely available under the MIT license at https://github.com/issararab/EPSAPG. |
| |
Language
|
|
|
|
English
| |
Source (book)
|
|
|
|
Proceedings of the IEEE/ACM 10th International Conference on Big Data Computing, Applications and Technologies (BDCAT'23), December 4–7, 2023, Taormina (Messina), Italy
| |
Publication
|
|
|
|
2023
| |
ISBN
|
|
|
|
979-84-00-70473-4
| |
DOI
|
|
|
|
10.1145/3632366.3632384
| |
Volume/pages
|
|
|
|
(2023)
, p. 1-9
| |
Article Reference
|
|
|
|
03
| |
ISI
|
|
|
|
001211861000003
| |
Full text (Publisher's DOI)
|
|
|
|
| |
Full text (open access)
|
|
|
|
| |
Full text (publisher's version - intranet only)
|
|
|
|
| |
|