Model that solves Hypothesis Tests based on an inserted word problem¶
The goal is to compile a model that can take in hypothesis test questions, and solve them, using the correct set ups.
To Do list:¶
1) Create a dataset of hypothesis testing word problems to train the model on all chosen types of tests¶
Train it on a lot of repeat (similar) questions with different numbers, but we can do things like change n so many times near 30 that it learns the threshold for T tests.
- only 20 problems in the dataset.
2) Pick and an NLP classifier (or several and choose the most prevalent output for test choosing) and train it on the dataset we create¶
What the NLP needs to do: Test Type prediction, Alternative Hypothesis direction, pooled variance decisions, alpha detection.
- Classification: given a word problem, decide if its 1-prop-z-test, chi-square, 2-sample-T-test, etc
- Classification: Determine the direction of the test based on the Alternative hypothesis: greater, less, or two sided?
- determine if pooled variance is needed, or which Z, T to use. It can learn from phrases like "assume equal variances."
- Extract Alpha (significance level). The model has to detect all different ways alpha can be presented. "significant, 90% confidence, etc)
We can use hard coding of core statistical rules with logic as opposed to machine learning. so when test type is "mean" and N<30, use T test. When Signma is known, use Z test. If general test type is "variance", use chi-sq variance test. if hypothesis is var/var, use F test. This balancing of ML vs deterministic logic protects from lack of data (since we will be writing it ourself).
- we used TfidfVectorizer to sort through the problem text, and a logistic regressor to solve the classification.
3) Extract key functions from the word problems¶
regex is to fragile to use, so we will have to train the Model to extract the functions from the dataset. This led to some issues, and we will have to do a revised approach in the future
- the model is predicting all values on our data sheet, even if not given in the word problem. To fix, we can perhaps create several models for each test type, and choose which test type model to use based on the test type classifier
4) Build a rule-based Hypothesis builder¶
Grab previous results, Map greater than, less than, and two sided symbols, gives hypothesis test
5) define math functions to correctly perform the chosen test (choosing the right or wrong is a whole different step)¶
1-sample Z for mean ✔️ 1-sample T for mean ✔️ 2-sample T (Pooled) ✔️ 2-sample T (Welch’s) ✔️ 1-prop Z ✔️ 2-prop Z ✔️ Chi-square variance ✔️ F-test (var ratio) ✔️ Chi-square independence ✔️ Chi-square goodness of fit ✔️
- for simplicity, we did not include chisquare independence or goodness of fits in our dataset.
6) create a "final report" generator###¶
Note: the data for two-taailed tests uses language like: "test if the true mean is 81, vs not 81"
Math Functions Defined:¶
import scipy.stats as stats
import numpy as np
def one_sample_z_test_mean(x_bar, mu_0, sigma, n, alternative='two-sided', alpha=0.05):
z = (x_bar - mu_0) / (sigma / np.sqrt(n))
if alternative == 'two-sided':
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
elif alternative == 'greater':
p_value = 1 - stats.norm.cdf(z)
else: # 'less'
p_value = stats.norm.cdf(z)
conclusion = 'Reject H0' if p_value < alpha else 'Fail to Reject H0'
return z, p_value, conclusion
def one_sample_t_test_mean(x_bar, mu_0, s, n, alternative='two-sided', alpha=0.05):
t = (x_bar - mu_0) / (s / np.sqrt(n))
df = n - 1
if alternative == 'two-sided':
p_value = 2 * (1 - stats.t.cdf(abs(t), df))
elif alternative == 'greater':
p_value = 1 - stats.t.cdf(t, df)
else: # 'less'
p_value = stats.t.cdf(t, df)
conclusion = 'Reject H0' if p_value < alpha else 'Fail to Reject H0'
return t, p_value, conclusion
def one_proportion_z_test(x, n, p0, alternative='two-sided', alpha=0.05):
p_hat = x / n
z = (p_hat - p0) / np.sqrt(p0 * (1 - p0) / n)
if alternative == 'two-sided':
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
elif alternative == 'greater':
p_value = 1 - stats.norm.cdf(z)
else: # 'less'
p_value = stats.norm.cdf(z)
conclusion = 'Reject H0' if p_value < alpha else 'Fail to Reject H0'
return z, p_value, conclusion
def two_proportion_z_test(x1, n1, x2, n2, alternative='two-sided', alpha=0.05):
p1 = x1 / n1
p2 = x2 / n2
p_pool = (x1 + x2) / (n1 + n2)
z = (p1 - p2) / np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
if alternative == 'two-sided':
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
elif alternative == 'greater':
p_value = 1 - stats.norm.cdf(z)
else: # 'less'
p_value = stats.norm.cdf(z)
conclusion = 'Reject H0' if p_value < alpha else 'Fail to Reject H0'
return z, p_value, conclusion
def two_sample_t_test_pooled(x1_bar, x2_bar, s1, s2, n1, n2, alternative='two-sided', alpha=0.05):
s_pooled = np.sqrt(((n1 - 1)*s1**2 + (n2 - 1)*s2**2) / (n1 + n2 - 2))
t = (x1_bar - x2_bar) / (s_pooled * np.sqrt(1/n1 + 1/n2))
df = n1 + n2 - 2
if alternative == 'two-sided':
p_value = 2 * (1 - stats.t.cdf(abs(t), df))
elif alternative == 'greater':
p_value = 1 - stats.t.cdf(t, df)
else: # 'less'
p_value = stats.t.cdf(t, df)
conclusion = 'Reject H0' if p_value < alpha else 'Fail to Reject H0'
return t, p_value, conclusion
def f_test_variance(s1_squared, s2_squared, n1, n2, alternative='two-sided', alpha=0.05):
F = s1_squared / s2_squared
df1, df2 = n1 - 1, n2 - 1
if alternative == 'two-sided':
p_value = 2 * min(stats.f.cdf(F, df1, df2), 1 - stats.f.cdf(F, df1, df2))
elif alternative == 'greater':
p_value = 1 - stats.f.cdf(F, df1, df2)
else: # 'less'
p_value = stats.f.cdf(F, df1, df2)
conclusion = 'Reject H0' if p_value < alpha else 'Fail to Reject H0'
return F, p_value, conclusion
def chi_square_variance_test(s_squared, sigma0_squared, n, alternative='two-sided', alpha=0.05):
chi2 = (n - 1) * s_squared / sigma0_squared
df = n - 1
if alternative == 'two-sided':
p_value = 2 * min(stats.chi2.cdf(chi2, df), 1 - stats.chi2.cdf(chi2, df))
elif alternative == 'greater':
p_value = 1 - stats.chi2.cdf(chi2, df)
else: # 'less'
p_value = stats.chi2.cdf(chi2, df)
conclusion = 'Reject H0' if p_value < alpha else 'Fail to Reject H0'
return chi2, p_value, conclusion
def chi_square_test_independence(observed_table, alpha=0.05):
chi2, p_value, dof, expected = stats.chi2_contingency(observed_table)
conclusion = 'Reject H0' if p_value < alpha else 'Fail to Reject H0'
return chi2, p_value, conclusion
def chi_square_goodness_of_fit(observed, expected_probs, alpha=0.05):
expected = np.array(expected_probs) * sum(observed)
chi2, p_value = stats.chisquare(f_obs=observed, f_exp=expected)
conclusion = 'Reject H0' if p_value < alpha else 'Fail to Reject H0'
return chi2, p_value, conclusion
def two_sample_t_test_unpooled(x1_bar, x2_bar, s1, s2, n1, n2, alternative='two-sided', alpha=0.05): #compares 2 means without assuming equal variances
numerator = x1_bar - x2_bar
denominator = np.sqrt((s1**2 / n1) + (s2**2 / n2))
t_stat = numerator / denominator
# Degrees of freedom (Welch–Satterthwaite equation)
df_num = (s1**2 / n1 + s2**2 / n2) ** 2
df_denom = ((s1**2 / n1) ** 2) / (n1 - 1) + ((s2**2 / n2) ** 2) / (n2 - 1)
df = df_num / df_denom
if alternative == 'two-sided':
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
elif alternative == 'greater':
p_value = 1 - stats.t.cdf(t_stat, df)
else: # 'less'
p_value = stats.t.cdf(t_stat, df)
conclusion = 'Reject H0' if p_value < alpha else 'Fail to Reject H0'
return t_stat, p_value, conclusion
Building the Model¶
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor
import joblib
import os
# Load dataset
df = pd.read_csv("Hypothesis_Testing_Dataset.csv")
# Required columns
text = df['Problem_Text']
y_test_type = df['Test_Type']
y_alt = df['Alternative']
# fields to predict using ML
numerical_fields = ['n','n1','n2', 'x','x1','x2', 'p0', 'mu_0','x_bar','x1_bar','x2_bar','s','s1','s2', 'sigma', 'sigma_known','s_squared','s2_squared','sigma0_squared','alpha']
df[numerical_fields] = df[numerical_fields].fillna(0)
df = df.dropna(subset=['Problem_Text', 'Test_Type', 'Alternative'] + numerical_fields)
text = df['Problem_Text']
y_test_type = df['Test_Type']
y_alt = df['Alternative']
# Train/test split
X_train_text, X_test_text, y_train_type, y_test_type = train_test_split(text, y_test_type, test_size=0.2)
_, _, y_train_alt, y_test_alt = train_test_split(text, y_alt, test_size=0.2)
X_train_values, X_test_values, y_train_values, y_test_values = train_test_split(text, df[numerical_fields], test_size=0.2)
# Test type classifier
test_type_model = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', LogisticRegression(max_iter=1000))
])
test_type_model.fit(X_train_text, y_train_type)
# Alternative hypothesis classifier
alt_model = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', LogisticRegression(max_iter=1000))
])
alt_model.fit(X_train_text, y_train_alt)
# Numerical + Boolean field predictors (multi-output regression/classification)
value_model = Pipeline([
('tfidf', TfidfVectorizer()),
('reg', MultiOutputRegressor(LinearRegression()))
])
value_model.fit(X_train_values, y_train_values)
# Create model directory
#os.makedirs("models", exist_ok=True)
# Save models
#joblib.dump(test_type_model, 'models/test_type_model.pkl')
#joblib.dump(alt_model, 'models/alt_model.pkl')
#joblib.dump(value_model, 'models/value_extractor_model.pkl')
# Output a sample of the training data used for value extraction
#y_train_values.head()
# Load models
#test_type_model = joblib.load('test_type_model.pkl')
#alt_model = joblib.load('alt_model.pkl')
#value_model = joblib.load('models/value_extractor_model.pkl')
#def ml_predict(problem_text):
# predicted_type = test_type_model.predict([problem_text])[0]
# alternative = alt_model.predict([problem_text])[0]
# values = value_model.predict([problem_text])[0]
# return predicted_type, alternative, values
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('reg', MultiOutputRegressor(estimator=LinearRegression()))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('reg', MultiOutputRegressor(estimator=LinearRegression()))])
TfidfVectorizer()
MultiOutputRegressor(estimator=LinearRegression())
LinearRegression()
LinearRegression()
Define Prediction model:
def ml_predict(problem_text):
predicted_type = test_type_model.predict([problem_text])[0]
alternative = alt_model.predict([problem_text])[0]
values = value_model.predict([problem_text])[0]
return predicted_type, alternative, values
helper that detects/converts sigmas to variances or variances to sigmas¶
The goal is to handle problems where our equations require variance, but we are given sigma, or vice versa
import math
def auto_handle_sd_variance_conversions(predicted_type, extracted_values):
# === For Mean Tests (Z-test expects sigma) ===
# === For T-Tests (Sample SD used, but variance may be given) ===
if predicted_type in ['mean', 'mean_diff']:
if 's' not in extracted_values and 's_squared' in extracted_values:
extracted_values['s'] = math.sqrt(extracted_values['s_squared'])
if 's1' not in extracted_values and 's1_squared' in extracted_values:
extracted_values['s1'] = math.sqrt(extracted_values['s1_squared'])
if 's2' not in extracted_values and 's2_squared' in extracted_values:
extracted_values['s2'] = math.sqrt(extracted_values['s2_squared'])
# === For Variance Tests (Chi-Square) ===
if predicted_type == 'variance':
if 's_squared' not in extracted_values and 's' in extracted_values:
extracted_values['s_squared'] = extracted_values['s'] ** 2
# === For F-Test (Variance Ratio) ===
if predicted_type == 'var_ratio':
if 's1_squared' not in extracted_values and 's1' in extracted_values:
extracted_values['s1_squared'] = extracted_values['s1'] ** 2
if 's2_squared' not in extracted_values and 's2' in extracted_values:
extracted_values['s2_squared'] = extracted_values['s2'] ** 2
return extracted_values
Hypothesis Builder:¶
def build_hypotheses(predicted_type, extracted_values):
alt = extracted_values.get('alternative', 'two-sided')
operator = {'greater': '>', 'less': '<', 'two-sided': '≠'}.get(alt, '≠')
if predicted_type in ['mean', 'mean_diff']:
param = 'μ'
null_value = extracted_values.get('mu_0')
elif predicted_type in ['proportion', 'proportion_diff']:
param = 'p'
null_value = extracted_values.get('p0')
elif predicted_type == 'variance':
param = 'σ²'
null_value = extracted_values.get('sigma0_squared')
elif predicted_type == 'var_ratio':
param = 'σ₁² / σ₂²'
null_value = 1
elif predicted_type == 'goodness_of_fit':
return ("H₀: Data fits the expected distribution.",
"H₁: Data does not fit the expected distribution.")
elif predicted_type == 'independence':
return ("H₀: Variables are independent.",
"H₁: Variables are dependent.")
else:
return ("H₀: Undefined", "H₁: Undefined")
return f"H₀: {param} = {null_value}", f"H₁: {param} {operator} {null_value}"
Deterministic Logic Controller to be applied after the model runs:¶
We might not need this rule if we train the model on enough data, but this will help use correct tests in solidified use cases, such as n>30 or known sigma for a sample mean z test
def decide_statistical_test(predicted_type, extracted_values):
"""
Parameters:
predicted_type (str): Output from ML model (e.g., 'mean', 'proportion', 'variance', 'independence', 'goodness_of_fit', 'var_ratio')
extracted_values (dict): Contains extracted info like n, sigma_known, proportions_list, etc.
Returns:
str: The exact test to run (matches your function names)
"""
# Handle MEAN tests
if predicted_type == 'mean':
n = extracted_values.get('n', 0)
sigma_known = extracted_values.get('sigma_known', False)
if sigma_known or n >= 30:
return 'one_sample_z_test_mean'
else:
return 'one_sample_t_test_mean'
# Handle DIFFERENCE IN MEANS (assuming ML predicts 'mean_diff')
if predicted_type == 'mean_diff':
pooled = extracted_values.get('pooled_variance', False)
if pooled:
return 'two_sample_t_test_pooled'
else:
return 'two_sample_t_test_unpooled' # You’d define this function separately
# Handle PROPORTION tests
if predicted_type == 'proportion':
return 'one_proportion_z_test'
if predicted_type == 'proportion_diff':
return 'two_proportion_z_test'
# Handle VARIANCE tests
if predicted_type == 'variance':
return 'chi_square_variance_test'
# Handle VARIANCE RATIO (var/var)
if predicted_type == 'var_ratio':
return 'f_test_variance'
# Handle CHI-SQUARE INDEPENDENCE
if predicted_type == 'independence':
return 'chi_square_test_independence'
# Handle GOODNESS OF FIT
if predicted_type == 'goodness_of_fit':
proportions = extracted_values.get('expected_probs', [])
if len(proportions) > 0:
return 'chi_square_goodness_of_fit'
else:
raise ValueError("Goodness of fit test requires expected proportions.")
raise ValueError("Unknown test type or insufficient data.")
This Full Pipeline function ties it all together:¶
def hypothesis_test_pipeline(predicted_type, extracted_values):
test = decide_statistical_test(predicted_type, extracted_values)
if test == 'one_proportion_z_test':
return one_proportion_z_test(
x=extracted_values['x'],
n=extracted_values['n'],
p0=extracted_values['p0'],
alternative=extracted_values['alternative'],
alpha=extracted_values['alpha']
)
elif test == 'two_proportion_z_test':
return two_proportion_z_test(
x1=extracted_values['x1'],
n1=extracted_values['n1'],
x2=extracted_values['x2'],
n2=extracted_values['n2'],
alpha=extracted_values['alpha']
)
elif test == 'one_sample_z_test_mean':
return one_sample_z_test_mean(
x_bar=extracted_values['x_bar'],
mu_0=extracted_values['mu_0'],
sigma=extracted_values['sigma'],
n=extracted_values['n'],
alternative=extracted_values['alternative'],
alpha=extracted_values['alpha']
)
elif test == 'one_sample_t_test_mean':
return one_sample_t_test_mean(
x_bar=extracted_values['x_bar'],
mu_0=extracted_values['mu_0'],
s=extracted_values['s'],
n=extracted_values['n'],
alternative=extracted_values['alternative'],
alpha=extracted_values['alpha']
)
elif test == 'two_sample_t_test_pooled':
return two_sample_t_test_pooled(
x1_bar=extracted_values['x1_bar'],
x2_bar=extracted_values['x2_bar'],
s1_squared=extracted_values['s1_squared'],
s2_squared=extracted_values['s2_squared'],
n1=extracted_values['n1'],
n2=extracted_values['n2'],
alpha=extracted_values['alpha']
)
elif test == 'two_sample_t_test_unpooled':
return two_sample_t_test_unpooled(
x1_bar=extracted_values['x1_bar'],
x2_bar=extracted_values['x2_bar'],
s1=extracted_values['s1'],
s2=extracted_values['s2'],
n1=extracted_values['n1'],
n2=extracted_values['n2'],
alpha=extracted_values['alpha']
)
elif test == 'chi_square_variance_test':
return chi_square_variance_test(
s_squared=extracted_values['s_squared'],
sigma0_squared=extracted_values['sigma0_squared'],
n=extracted_values['n'],
alpha=extracted_values['alpha']
)
elif test == 'f_test_variance':
return f_test_variance(
s1_squared=extracted_values['s1_squared'],
s2_squared=extracted_values['s2_squared'],
n1=extracted_values['n1'],
n2=extracted_values['n2'],
alpha=extracted_values['alpha']
)
elif test == 'chi_square_test_independence':
return chi_square_test_independence(
observed_table=extracted_values['observed_table'],
alpha=extracted_values['alpha']
)
elif test == 'chi_square_goodness_of_fit':
return chi_square_goodness_of_fit(
observed=extracted_values['observed'],
expected_probs=extracted_values['expected_probs'],
alpha=extracted_values['alpha']
)
return (0, 1, 'No valid test run')
def full_hypothesis_testing_pipeline(problem_text):
predicted_type, alternative, predicted_array = ml_predict(problem_text)
# Normalize predicted_type if needed
type_map = {
'one_sample_t_test_mean': 'mean',
'one_sample_z_test_mean': 'mean',
'two_sample_t_test_unpooled': 'mean_diff',
'two_sample_t_test_pooled': 'mean_diff',
'one_proportion_z_test': 'proportion',
'two_proportion_z_test': 'proportion_diff',
'chi_square_variance_test': 'variance',
'f_test_variance': 'var_ratio',
'chi_square_test_independence': 'independence',
'chi_square_goodness_of_fit': 'goodness_of_fit'
}
predicted_type = type_map.get(predicted_type, predicted_type)
extracted_values = dict(zip(numerical_fields, predicted_array))
# Sanitize numerical predictions
for key in extracted_values:
if key != 'alternative' and isinstance(extracted_values[key], (int, float)):
if key == 'sigma_known':
extracted_values[key] = bool(round(extracted_values[key])) # force bool
elif extracted_values[key] < 0:
extracted_values[key] = 0 # no negative values for σ, α, etc.
extracted_values['alternative'] = alternative
print("\n=== DEBUG INFO ===")
print("Predicted Type:", predicted_type)
print("Alternative:", alternative)
print("Extracted Values:", extracted_values)
extracted_values = auto_handle_sd_variance_conversions(predicted_type, extracted_values)
H0, H1 = build_hypotheses(predicted_type, extracted_values)
test_to_run = decide_statistical_test(predicted_type, extracted_values)
result = hypothesis_test_pipeline(predicted_type, extracted_values)
print(f"\n📊 Problem: {problem_text}\n")
print(f"Hypotheses:\n{H0}\n{H1}")
print(f"\nSelected Test: {test_to_run}")
print(f"Test Statistic: {result[0]:.4f}")
print(f"P-Value: {result[1]:.4f}")
print(f"Conclusion: {result[2]}")
RUN SCRIPT BELOW:¶
# === Run Script ===
if __name__ == "__main__":
problem_text = input("Enter your hypothesis testing problem:\n")
full_hypothesis_testing_pipeline(problem_text)
=== DEBUG INFO === Predicted Type: mean Alternative: less Extracted Values: {'n': 73.00000258041263, 'n1': 0, 'n2': 1.304862060180767e-05, 'x': 0.0, 'x1': 0, 'x2': 0, 'p0': 0.0, 'mu_0': 20.0000032430679, 'x_bar': 18.860005920380377, 'x1_bar': 0.0005335929517968907, 'x2_bar': 0.0003833274267890374, 's': 0, 's1': 5.98005486835973e-05, 's2': 1.6086698337858252e-06, 'sigma': 8.599997685659842, 'sigma_known': True, 's_squared': 5.532176710065784e-07, 's2_squared': 0, 'sigma0_squared': 2.079422757095273e-06, 'alpha': 0.09999996142928147, 'alternative': 'less'} 📊 Problem: Minor surgery on horses under field conditions requires a reliable short-term anesthetic producing good muscle relaxation, minimal cardiovascular and respiratory changes, and a quick, smooth recovery with minimal after effects so that horses can be left unattended. A study reports for a sample of 73 horses to which the medicine is administered, the sample average recumbency time was 18.86 minutes. The recumbency time is know nto be normally distributed with a standard deviation of 8.6 minutes. Does this data suggest that true average lateral recumbency time under these conditions is less than 20 minutes at 0.10 level of significance? Hypotheses: H₀: μ = 20.0000032430679 H₁: μ < 20.0000032430679 Selected Test: one_sample_z_test_mean Test Statistic: -1.1326 P-Value: 0.1287 Conclusion: Fail to Reject H0
Results¶
The model chose n=73, 0.0999 alpha, 8.59 sigma, xbar = 18.86, and Ho mean = 20, it chose the correct test, and the correct hypothesis. The model worked decent on this question (that was included in the dataset), but it does not work as well for new questions. Much more data is needed to determine if this model could be beneficial for solving these questions.