Pitch-Level Data Mining: A Case Study on Gerrit Cole
I use pitch-level data from pybaseball to predict innings where Cole may let up one or more runs. To do this, first we looked at and uncovered Cole's key tendencies such as pitch preferences to lefties vs righties, and engineered these into features for a random forest classifer that was 83% accurate in predicting innings that resulted in a run scoring or not. This would be beneficial for coaching, mid-game strategy, and fantasy baseball applications. Based on this use case, we attempted to maximize recall to correctly flag as many 1+ run innings as possible and "err on the side of caution". I also wanted to engineer features that include psychological predictors to explain mid-game tilt or poor mechanical control, and look to further this research in future projects.

This graph shows Cole's probability of missing the strike zone, by fatigue level and runners on base, where fatigue level is based on pitch count, defined as:
early = less than 30 pitches, mid = between 30 and 60 pitches, and late = over 60 pitches.
Regression Model in R to Determine Various life stresses on Depression in Adults
Used stepwise regression techniques in R to analyze the relationship between stressful life experiences and depression, focusing on genetic predisposition related to the serotonin transporter gene (5-HTT). The analysis confirmed statistically significant associations between environmental stressors (E1, E3, E4), genetic markers (G3, G8), and the depression outcome, suggesting that individuals carrying the short allele of the 5-HTT polymorphism are more prone to depression. Project from Stony Brook University.
Single Predictor Linear Regression in R
Given a dataset in R, I merged files containing multiple variables, dealt with and imputed missing data, Transformed data to fit a linear regression and applied an approximate lack of fit test. We compared results of the Exponential model (DV=ln(y)), the Quadratic model (dv=sqrt(y)), the Reciprocal model (DV=1/y), the Logarithmic model (IV=ln(x)), and the Power model (DV=ln(y), IV= ln(x)) and found the power model to be the best fit Resulting in the highest R-squared value. Project from Stony Brook University.
Archived Projects
Old project files including machine learning, visualization analysis, & database formation.