Bio. I am an enthusiastic, intellectually curious, data-driven solution-oriented Data Scientist and continuous learner with problem-solving strengths and newly acquired skills in machine learning and data analysis. I am an active hands-on learner with a passion for growing within my expertise, creating meaningful and impactful work using new data science and machine learning techniques. I am a motivated person with excellent leadership & communication skills. I always have a positive mindset and I am looking to gain valuable experiences in data science. My objectives are innovation and meaningful contribution to society. I am a big fan of Bayesian statistics and data visualization using ggplot. I enjoy designing complex statistical and algorithmic solutions to problems using machine learning.

During my graduate studies I have conducted research in Prof. Olga Baysal’s Software Analytics lab. My Master’s thesis was focused on studying how to define, and quantify the expertise of software developers based on publicly available data from GitHub and Stack Overflow. The title of my thesis was "Cross-Platform Software Developer Expertise Learning" and successfully defended it on April 21st 2020. In my thesis, I worked with LDA topic models, which gave me an in-depth knowledge of Bayesian approaches. The main application of my graduate work is the recruitment of the right candidate by defining someone's expertise area based on their user activities on collaborative platforms. I graduated from Carleton University's Data Science program in June 2020 with a Master of Computer Science Specialization in Data Science degree.

Currently I am actively looking for a full-time data scientist position. My key technical skills include, but not limited to Python, R, SQL, Java, report writing, traditional machine learning, deep learning, data mining, data analysis, data collection, cleaning, and wrangling, data visualization, statistical modeling, EDA, handling unstructured data, algorithm and experiment design.

My interests include, but not limited to machine learning, healthcare applications of machine learning (e.g. medical imaging), deep learning, time series forecasting, outlier detection, text analytics, data mining, sports analytics, real-time data analytics and simulations, motorsports data analytics & race strategy, conveying useful information through dashboard visualizations, and applications of computer vision and natural language processing.

June 19th, 2020: Graduated with the Master of Computer Science Specialization in Data Science degree from Carleton University in Ottawa, Canada. See my LinkedIn post about it.
April 21st, 2020: Successfully defended my Master's thesis titled Cross-Platform Software Developer Expertise Learning at Carleton University in Ottawa, Canada.
September 2018 - April 2020: Graduate Research and Teaching Assistant at Carleton University
Master's thesis in mining Stack Overflow and GitHub creating a novel approach to cross platform software developer expertise learning
September 2019: Featured in Carleton University's Eureka! magazine:
A LinkedIn post about my article can be found here. This article was featured on Carleton University's Instagram and LinkedIn page as well
May 2019 - August 2019: Data Scientist Intern at National Research Council Canada:
Worked in NRC's Data Analytics Center in Ottawa, and completed a 4 month contract for a government client
September 2018 - June 2020: Carleton University - M.Sc. in Computer Science with Specialization in Data Science:
Data Mining, Machine Learning, NLP, Deep Learning and Empirical Software Engineering. Adviser: Prof. Olga Baysal
May 2017 - August 2017: Undergraduate Researcher at University of British Columbia Okanagan:
Received an Undergraduate Research Award and worked a modern approach to feature-based opinion mining, using word embeddings
September 2015 - April 2018: Undergraduate Teaching Assistant at University of British Columbia Okanagan:
Helped students apply concepts taught in lectures via hands-on programming
September 2014 - June 2018: University of British Columbia Okanagan - B.Sc. Honours in Computer Science, Minor In Data Science
Completed an Undergraduate Thesis Data Science project under the supervising of Prof. Abdallah Mohamed and Prof. Jeffrey Andrews

Data Science Professional Development (2020)

SMOTE and ADASYN for Imbalanced Data
May 2020. Learning about oversampling and undersampling techniques for combatting imbalaced data sets
Framework for Imbalanced Data Classification
May 2020. Learning about a systematic step-by-step framework for imbalanced classification projects
Data Science Interview - 2 hour coding challenge
July 2020. Practicing data cleansing, EDA, and data modeling using Pandas, Numpy, Sci-kit Learn and Matplotlib
Reviewing t-tests
July 2020. Practising how to perform t-tests in Python using Scipy
Reviewing Statsmodels API
July 2020. Learning how to use the statsmodels library in Python
Reviewing Descriptive Statistics
July 2020. Practicing how to get the basic descriptive statistics of any data
Reviewing Matplotlib API
August 2020. Practising how to visualize data using the Matplotlib library in Python
Reviewing Scikit-Learn API
August 2020. Practising how to fit various models using the Scikit-Learn library in Python
Reviewing Pandas API
August 2020. Practicing how to do basic data transformations using the Pandas library in Python
Time Series Analysis in Python
August 2020. Practising visualization of time series; how to detrend; how to test for seasonality; how to deseasonalize a time series
Time Series Analysis & Forecasting
August 2020. Learning about AR, MA, ARMA, ARIMA, SARIMA models by applying them to the minimum daily temperatures data set
Time Series Forecasting with SARIMA models
August 2020. Learning how to do grid search for SARIMA model's hyper-parameters for time series forecasting
Practising K-Means Cluster Analysis
September 2020. Practising how to perform cluster analysis, and determine the number of clusters in a data set using the Scikit-Learn library in Python
Practising Unsupervised Feature Selection
September 2020. Practising how to perform feature selection using the Scikit-Learn library in Python
Practising Unsupervised Outlier Detection
September 2020. Practising how to perform unsupervised outlier detection using the Scikit-Learn library in Python
Performing Cluster Analysis on Spotify songs
June 2020. Played around a bit with analyzing my Spotify profile and clustered some of the songs that I like with what two other users liked.
Bias-Variance Decomposition
September 2020. Practising the calculation of bias and variance of various machine learning models for classification and forecasting tasks
Data Science Interview - take home assignment
September 2020. Practicing algorithm design, EDA, and data analysis using Pandas, Numpy, Sci-kit Learn and Matplotlib

Coursera Courses (2020)

Academic Research Projects (2017 - 2020)

Master's Thesis: Cross-Platform Software Developer Expertise Learning
In today's world software development is a competitive field. Being an expert gives software engineers opportunities to find better, higher-paying jobs. Recruiters are always searching for the right talent, but it is difficult to determine the expertise of a developer only from reviewing their resume. To solve this problem expertise detection algorithms are needed. A few problems arise when expertise is put into application: how can developer expertise be defined, measured, extracted or even learnt? Our work is attempting to provide recruiters a data-driven alternative to reading the candidate's CV or resume.In this thesis, we propose three novel topic modeling based, robust, data-driven techniques for expertise learning. Our extensive analysis of cross-platform developer expertise suggests that using multiple collaborative platforms is the optimal path towards gaining more knowledge and becoming an expert, as cross-platform expertise tends to be more diverse, thus creating opportunities for more effective learning by collaboration.
Eke, Norbert
Defended April 21st, 2020
Exploring the Evolution of Stack Overflow Discussions Using Sentimental Analysis on Comments
Stack Overflow is a popular QA forum for software developers, providing a large amount of discussion in form of posts and their comments. SO posts evolves with time, both in text and code snippets, so does the associated discussion with them. In this paper, we investigate the evolution of SO posts with respect to SO discussions, a factor usually ignored in techniques aimed to find relevance of a post for particular objective. To accomplish our goal, we mine SOTorrent data set that provides version history of posts and comments with time line. We then study the characteristics of discussions in form of comments with respect to evolution time line of post. Our results demonstrate that on average sentimental trend favors positive sentiment as posts becomes more stable with time, characterizing more approval from SO community in comment section.
Eke, Norbert and Manes, Saraj Singh
Linking Stack Overflow and Github Public Data for Mining Purposes
Developer expertise learning and recommendation is the task of defining and quantifying the expertise areas and levels of developers, then creating a top-n ranking for developers who are most qualified to perform a task. A software repository mining approach on this task would allow the creation of a developer expertise profile consisting of topical expertise and interest distributions learned from Stack Overflow and Github public data. This project addresses building a database consisting of Stack Overflow and Github public data, then linking them together based on a common attribute.
Eke, Norbert
show more
Anomaly detection with Generative Adversarial Networks and text patches
In this research work the possibility of adapting image based anomaly detection into text based anomaly detection was explored. Two main approaches are being proposed, namely anomaly detection as a task of classification and unsupervised anomaly detection using text patches. Both approached explore the use of generative adversarial networks to perform anomaly detection and results presented show that such can be fruitful.
Eke, Norbert and Drozdyuk, Andriy
Honours Thesis: Identification and Classification of Sexual Predatory Behavior in Online Chat-Room Environments
According to the Crimes Against Children Research Center, one in five U.S. teenagers who regularly use the Internet have received an unwanted sexual solicitation via the web. There is an increasing danger in online environments such as chat-rooms, where predatory behaviour is more and more frequent, creating an unsafe environment for minors. This project aims to design an approach for online communities to enhance their member's safety by detecting malicious conversations of sexual nature. This project joins the powers of computational linguistics with statistical machine learning to decipher the insight lying in conversations, then make predictions on whether or not a specific conversation should be flagged for containing sexual predatory behaviour. The contribution of this novel approach is 2-fold: firstly, the approach is able to capture the contextual details by putting an emphasis on insight that lies within the conversation, and secondly it contains a 2 stage classification system, which is highly flexible and customizable for detecting and classifying other malicious textual data.
Eke, Norbert and Mohamed, Abdallah and Andrews, Jeffrey
Feature Based Opinion Mining: A Modern Approach
In a world where customers can buy products with a few clicks online, future customers must consider the opinions and satisfaction levels of previous customers. In order to allow one to understand what previous customers have said, the design of an automated technique that summarizes opinions of thousands of customers is desirable. A promising technique has been developed that combines continuous vector representation models, natural language processing techniques and statistical machine learning models. This technique has been tested on labelled datasets and it extracts over 80% of opinions correctly. Future research can focus on improving the technique's limitations on edge cases.
Eke, Norbert, and Andrews, Jeffrey and Mohamed, Abdallah

Academic Course Projects (2015-2019)

Early Data Science Work
December 2017. A repository dedicated to showcasing my Data Science research from 2015 to 2017 summarized into one portfolio document
Exploratory Topic Modeling
June 2016. Exploratory project in Deep Learning and Topic Modelling combined with Natural Language Processing in order to find topics within textual data.
Formula 1 Fan Forum
March 2017. Forum type client and server side web-development using a database.
Text Entailment and Semantic Relatedness
March 2019. In this project deep neural networks were used to solve the task of Text Entailment and Semantic Relatedness.
Consulting for Statistical Society of Canada
December 2017. Consulting project to create a better, data driven conference schedule for the Statistical Society of Canada
Scenic Route Generator for Touristic Attractions
October 2016. In this project we designed a local tourism route creator based on attractions within the Okanagan valley.
Data Collector
May-August 2015. Implemented my own scraping algorithm for
Chain-HashMap Implementation
August 2016. Implementation of a Chain HashMap with the help of a Data Structure textbook.
Skip List Implementation
August 2016. Implementation of a Skip List using a SortedMap
Software Eng. Capstone
September 2017-April 2018. Undergraduate Capstone Project: A torrent based video sharing service backend and web application front end for use with Raspberry Pi's
Road Line & Sign Detection
April 2017.In this project we applied famous object detection algorithms such as SIFT and Hough transform to detect road lines and signs
Spam Email Detection
April 2016. Spam email detection in R using statistical machine learning
Academic Project Proposals and Reports
March 2019. A repository dedicated to showcasing my latest academic projects and thesis work during my Master's degree
Game Of the Amazons AI
April 2017. We built a state space based AI agent capable of playing the Game Of the Amazons using a heuristic function and the Minimax algorithm
Book Citation Generator
April 2017. We designed a digital library and book citation generator using a database

© Norbert Eke
Design adapted from Ekaba Bisong