Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths worldwide.
Early detection can significantly improve survival rates. Recent studies show that the gut microbiome plays a key role in CRC development, where microbial DNA patterns can serve as biomarkers for early diagnosis.
This project provides an end-to-end automated pipeline for CRC detection from gut microbiome DNA sequences, consisting of two main phases:
- Traditionally, microbiome DNA preprocessing is performed with R scripts (e.g., DADA2), requiring manual steps.

- We automated this workflow entirely in Python, so the user only needs to provide the raw DNA sequences (.fasta files).
- The pipeline automatically:
- Processes and cleans the sequences
- Sorts the data into the required folder structure
- Prepares a filtered dataset ready for machine learning
👉 Result: A fully automated and scalable preprocessing step, removing manual intervention.
- Using the preprocessed data, we apply machine learning models to classify samples as CRC-positive or CRC-negative.
- Steps:
- Extract k-mers from DNA sequences
- Vectorize k-mers into numerical features with
CountVectorizer
- Split dataset into train/test (80/20)
- Train multiple models: Logistic Regression, Random Forest, SVM, KNN
- Evaluate models using Accuracy, Precision, Recall, F1-score, and Confusion Matrices
👉 Result: Logistic Regression achieved the best performance, making it the most reliable model for deployment.
- Python:
pandas
,numpy
,matplotlib
,seaborn
,scikit-learn
,biopython
- Automation: R-to-Python pipeline integration
- Models: Logistic Regression, Random Forest, SVM, KNN
🎥 Watch the demo here: YouTube Video
Supervised by:
Soumaya Jebara (UM6SS)
Asma Amdouni (SMU)
This project demonstrates that automated DNA preprocessing + machine learning can provide a scalable and reliable solution for early colorectal cancer detection from microbiome data.
By automating the preprocessing pipeline and testing multiple models, we show that Logistic Regression is the most effective approach, paving the way for clinical integration.
This project is open-source and available under the MIT License. See the LICENSE file for details.