About Me

Welcome to my professional portfolio!

I am a Doctoral Researcher at École de technologie supérieure, where I focus on cutting-edge AI to automate software maintenance and MLOps, AI Safety and Software Engineering. My current work investigates Technical Debts in LLMs—developing neuro-symbolic tools to detect hidden bugs and security risks in AI-generated code. I NLP with program analysis to build QA systems for the AI era.

My background combines a strong foundation in Data Science and Machine Learning Systems from Northeastern University with a B.S. in Mathematics and Statistics from the University of Toronto.

This repository showcases my expertise in developing innovative solutions for complex, real-world challenges across Natural Language Processing, MLOps, advanced predictive modeling, and strategic AI system deployment.

Education

Doctor of Philosophy (Ph.D.) in Engineering - École de technologie supérieure (Current, Year 1)
- Supervisors: Dr. Manel Abdellatif and Dr. Taher Ghaleb
M.S. in Information Systems, Data Science and ML Systems Engineering - Northeastern University (Expected: April 2025)
B.S. in Mathematics and Statistics - University of Toronto (April 2023)

Professional Experience

Machine Learning Engineer - Vector Institute for Artificial Intelligence
- Worked on LLMs, NLP, and Generative AI project, HAPI 24/7 SMS service at Duologue Systems to incorporate Agentic AI-driven conversational systems with LangGraph and OpenAI Agents SDK. Human-in-the-loop (HITL) auditing.
Graduate Teaching Assistant - Northeastern University
- CSYE7380 - Theory & Practice of AI Generative Models, CSYE 7230-02 Software Engineering

Publications & Presentations

SANER 2026 – IEEE International Conference on Software Analysis, Evolution and Reengineering
- Title: MLmisFinder: A Specification and Detection Approach of Machine Learning Service Misuses
Paper Accepted to 2025 IEEE 13th International Conference on Healthcare Informatics (ICHI)
- Title: Multidimensional Analysis of Specific Language Impairment Using Unsupervised Learning Through PCA and Clustering
Poster Presentation at the 2nd European Congress on Renewable Energy and Sustainable Development
- Title: Predictive Modelling of Renewable Energy Generation and CO2 Emissions: Insights from U.S. Electricity Sector Data (2018-2023)

Highlighted Projects

Below are some of my key projects demonstrating my skills and experience. Each project is structured to highlight the challenge, my solution, and the impact/results achieved.

1. Named Entity Recognition for Restaurant Search Queries

View The Model in Hugging Face (1000+ model downloads)

Challenge: Developed an accurate Named Entity Recognition (NER) system for restaurant search queries, a critical component for enhancing search and recommendation systems.

Solution: Fine-tuned a DistilBERT model leveraging transfer learning to accurately extract structured information (ratings, cuisines, locations, amenities) from free-form text.

Impact & Results:

Achieved robust performance with a Precision of 0.766, Recall of 0.803, F1-Score of 0.784, and Accuracy of 0.916 on the MIT Restaurant Search NER dataset.
Successfully deployed and hosted the model on Hugging Face Model Hub, resulting in 1000+ model downloads, demonstrating real-world applicability and community value.
Demonstrates expertise in NLP, deep learning, domain-specific problem-solving, and MLOps (model deployment).

Screenshot 2025-03-10 185601 Screenshot 2025-03-10 201215

2. Hybrid Graph Neural Network for Financial Fraud Detection

Challenge: Built a production-scale fraud detection system to identify fraudulent transactions in massive financial dataset - IEEE-CIS Fraud Detection dataset from Kaggle. Processed 590,540 transactions with extreme class imbalance (3.5% fraud rate), requiring both tabular feature learning and complex network relationship modeling to surpass industry-standard gradient boosting methods.

Solution: Developed an innovative Hybrid Graph Neural Network combining GraphSAGE layers with deep tabular networks and cross-attention fusion. Engineered 200+ advanced features including temporal patterns, network connectivity metrics, and multi-dimensional risk scoring. Implemented memory-optimized graph construction handling 1.5M+ edges with fraud-aware weighting.

Impact & Results:

Achieved 86.18% PR-AUC, beating LightGBM baseline (84.03%) by ~3 - significant improvement in fraud detection
Production-ready system processing 590K+ transactions with optimized memory usage (<15GB GPU)
Technologies: Python, PyTorch, PyTorch Geometric, GraphSAGE, LightGBM, CUDA Optimization, Advanced Feature Engineering

3. California Renewable Energy Forecasting & Emissions Optimization System

Challenge: Developed a comprehensive system to forecast renewable energy generation and optimize energy mix for CO2 emission minimization in California, addressing grid stability concerns.

Solution: Engineered a robust ETL pipeline for 43,800 hourly observations (2018-2023) from EIA’s Grid Monitor. Developed predictive models and a linear programming optimization framework (using PuLP) to balance renewable integration with emissions reduction.

Impact & Results:

Achieved 97% accuracy in renewable generation prediction.
Identified potential for a 30% reduction in CO2 emissions through optimized energy mix.
Demonstrated feasibility of renewable integration with a stability correlation of 0.07.
Provided actionable recommendations leading to a 13.29% increase in renewable energy share.
Processed 5 years of hourly data (43,800 observations), showcasing scalability and data engineering prowess.

Technical Stack: Python, SQL, Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, PuLP, Git. Data Source: U.S. Energy Information Administration (EIA) “Hourly Electric Grid Monitor” dataset.

4. Climate Change Chatbot with RAG

🏆 2nd Place Winner - Climate Resiliency Hackathon 2024 (400+ participants, 10 Northeastern University campuses across North America)

Challenge: Developed a sophisticated information retrieval and natural language processing system to enable accurate semantic search, context-aware document retrieval, and real-time information validation across vast climate science datasets, specifically focused on Canada.

Solution: Engineered a robust document processing pipeline for diverse sources (IPCC Reports, ECCC Climate Data, University Research Papers). Implemented advanced text preprocessing, custom tokenization, and domain-specific entity recognition, achieving 95% retrieval accuracy for relevant documents. The system uses a Retrieval-Augmented Generation (RAG) architecture with a vector database to provide precise, data-driven LLM responses.

Impact & Results:

Awarded 2nd place in a highly competitive hackathon, demonstrating innovation and effectiveness.
Enabled 95% retrieval accuracy for relevant documents across massive climate datasets.
Makes complex climate knowledge accessible and actionable by providing precise, data-augmented LLM responses.
Showcases expertise in NLP, information retrieval, RAG architectures, and handling large, diverse datasets.

5. MAHD: Conservative Multi-Agent System for Contextual Hateful Meme Detection Using GPT-4

Project Overview: Developed MAHD (Multi-Agent Hate Detection), a novel dual-agent system built on GPT-4 for robust hateful meme detection. MAHD employs a conservative classification approach, achieving high precision while effectively capturing subtle forms of harmful content.

Key Features & Impact:

Dual-agent architecture for comprehensive content analysis, enhancing detection capabilities.
Conservative classification protocol with strict calibration, leading to trustworthy moderation decisions.
Achieved 81.5% Overall Accuracy, with a 93.02% Recall Rate.
Demonstrated high accuracy in specific areas: 94% in Explicit Hate Speech Detection and 91% in Identifying Calls to Violence.
Provides detailed explanation generation for moderation decisions, fostering transparency and accountability.

6. Multivariate Analysis of Language Impairment Patterns Using PCA and Clustering

Project Overview: Applied advanced data science techniques (PCA and K-means clustering) to analyze patterns in language impairment using a dataset of 1,163 participants with 64 linguistic features.

Key Contributions & Impact:

Reduced 64 dimensions to 14 significant components via PCA, explaining 83.55% of total variance, revealing core structures in language development.
Identified two distinct natural groupings in language development patterns using K-means clustering.
Demonstrated robust cluster formation with strong silhouette scores (0.380-0.460) and 96.5% consistency across different PC space combinations.
Contributes to a better understanding of natural language development, potentially improving early diagnosis of language disorders.
Showcases expertise in dimensionality reduction, unsupervised learning, statistical validation, and data visualization for complex biomedical data.

7. Maternal Health Risk Prediction (Course Project)

Project Context: Developed a machine learning system to identify high-risk pregnancies in rural Bangladesh during a graduate-level Data Science course (Prof. Junwei Huang), addressing critical healthcare challenges in resource-limited settings with incomplete and imbalanced data.

Challenges Addressed:

Highly imbalanced medical data: Addressed rare high-risk cases (15% of dataset).
Missing data points: Handled 30% incomplete records effectively.
Limited feature availability in rural settings.

Solution & Impact:

Developed a novel ensemble architecture that outperformed standard methods (Gradient Boosting, K-Nearest Neighbors) by 10.5% in precision, 9.8% in recall, and 11% in F1-score on average.
Achieved 92% accuracy in identifying high-risk cases, demonstrating significant potential for improving maternal health outcomes in vulnerable communities.
Showcases ability to adapt ML techniques to real-world data challenges and apply technology for social impact.

8. Direct Preference Optimization (DPO)

Project Overview: Focused on generating a preference dataset using PairRM and fine-tuning the Mistral-7B-Instruct model with Direct Preference Optimization (DPO), a powerful training recipe.

Key Contributions & Learning:

Successfully fine-tuned Mistral-7B-Instruct using DPO, eliminating the need for a separate reward model.
Demonstrated the effectiveness of DPO by generating and comparing completions from the original and DPO-tuned models across 10 unseen instructions.
Showcases hands-on experience with advanced LLM fine-tuning techniques and understanding of preference-based optimization.
Gained practical knowledge in assessing LLM performance improvements from fine-tuning processes.

9. Web Scraping Project: Financial Data Collection from Yahoo Finance

Technologies Used: Python, BeautifulSoup4, Pandas, Requests

Project Overview: Developed an automated web scraping system to collect comprehensive financial metrics from Yahoo Finance for major S&P500 companies (e.g., Apple, Google, Microsoft).

Key Features & Impact:

Automated data collection pipeline: Designed functions to parse HTML, construct dynamic URLs, and manage rate-limiting, ensuring efficient data acquisition.
Robust data processing: Organized raw data into structured CSV formats, implementing custom scripts for flattening and validation.
Comprehensive data output: Successfully extracted critical financial metrics including balance sheets, income statements, cash flow statements, and management effectiveness metrics (e.g., ROE, ROA, Profit Margins, YoY Revenue Growth, P/E Ratio, Market Capitalization).
Demonstrates strong skills in data acquisition, parsing, cleaning, and structuring, essential for data science and financial analysis roles.

10. Sales Analysis Dashboard in Power BI

Project Overview: Developed a comprehensive Power BI dashboard for FY21 sales analysis, tracking and visualizing key performance metrics.

Key Features & Impact:

Created interactive visualizations (bar charts, gauge charts, and summary tables) to monitor revenue, targets, and segment performance.
Implemented dynamic filters and slicers for detailed data analysis across various dimensions (segment, industry, product, etc.).
Provided actionable insights by comparing revenue against marketing spend, directly supporting strategic business decisions.
Utilized advanced DAX functions and Power Query for robust data transformation and modeling.
Showcases strong business intelligence, data visualization, and data storytelling skills.

Key Technologies: Power BI, DAX, Power Query, SQL, Excel