All published articles of this journal are available on ScienceDirect.

RESEARCH ARTICLE

Tool FindCrispr: An Accurate Identification of Crisprs

Chunmei Wang1 , * Open Modal iD Authors Info & Affiliations
The Open Bioinformatics Journal 21 July 2025 RESEARCH ARTICLE DOI: 10.2174/0118750362401268250715144626

Abstract

Introduction

The accurate identification of repeats and clustered regularly interspaced short palindromic repeats (Crisprs)has a deeper and further impact on studying and learning about prokaryotic immune systems.

Methods

Based on the concept theory of Crispr, this study constructs a feature extraction method. A model with parameters and the objective function max(α1, α2, α3, α4) is trained on 302 archaea sequences and solved. The scoring-based machine learning model algorithm is implemented in Python language and made as a tool. The results of Crisprs obtained from findCrispr and pilerCR on 302 archaea sequences are reviewed by manual curation for the evaluation of the model. The Welch’s t-test is conducted on the repeater copy counts, the length of repeater, the length of spacer, and the count of Crisprs calculated by findCrispr and pilerCR on 400 archaea complete genome sequences, 169 randomly selected bacterial genome sequences, and 26 archaea chromosome gene sequences, besides 302 gene data.

Results

Based on the concept theory of Crispr, the length l of the repeater, the copy number m of the repeater, the starting position sequence stpt of the repeater and the repeater sequence as the features of the algorithm. The model is solved to find the scoring formula . The sequence with overlapping starting points with the highest score among the absolutely repeat sequences is selected as Crispr, which is implemented in Python language and made as a tool findCrispr. The tool findCrispr can automatically output the report file and visual pictures showing Crisprs. Among 302 archaea, 199 obtained the same results as pilerCR using findCrispr; 86 obtained more Crisprs than pilerCR; and 17 obtained fewer Crisprs than pilerCR. The Welch’s t-test shows that the count of Crisprs recognized by the tools findCrispr is significantly different with tstat>0, and for the count of repeater copies, the length of repeater and the length of spacer, the proportion of no significant difference in each type of data accounts for more than 85 percent.

Discussion

The feature extraction method based on the concept theory of Crispr is determined after the deep excavation of Crispr features. and the number of features is greatly reduced, but enough to accurately identify Crispr. The model performs well on 302 archaea data and can accurately identify Crispr. The tool findCrispr can successfully identify Crispr and is easy to use with the report file and the visual pictures accurately showing Crispr information, which shows that tool findCrispr can identify more Crisprs. The tool findCrispr maintains robust correctness in each type of data. The algorithm is a very special algorithm and is inclined to find more repeaters, which is sensitive in finding Crispr with a small duplicate number and is low in the tolerance for long scattered repeats.

Conclusion

The length l of the repeater, the copy number m of the repeater, and the starting position sequence stpt of repeater, repeater sequence and so on are extracted as features. A scoring system is established, an accurate identification tool findCrispr is realized and performs superior to the commonly used Crispr analysis software pilerCR, in the identification of Crisprs with multiple calibration repeaters. The tool findCrispr is of great significance for studying the biological function and mechanism of Crispr. Accurate identification of Crispr and its repeat and spacer sequences are of great significance for exploring the biological mechanism of Crispr adaptive immune system and understanding the biological evolution significance of repeat and spacer. Accurate identification of Crispr provides data support for accurate prediction of gene therapy, gene editing, gene expression regulation, and targeted clearance, and plays an important role in discovering more abundant Cas (Crispr-associated) proteins to complement and improve Crispr/Cas system. The tool findCrispr is easy and powerful to use and extensible into a statistical analysis tool for processing prokaryotic gene sequence data in batches on macro data of Crispr to identify single or multiple Crispr.

Keywords: Crispr, Repeat, Score, findCrispr.
Fulltext HTML PDF
1800
1801
1802
1803
1804