Contextual Advertising: Supervised Text Classification¶
Imagine you're a media buying company, Chrishare. They have a new client, Theragun.
Theragun knows that consumers who value health and wellness are more likely to consider, and ultimately buy their product. So, they’d like to find health and wellness news around the web to advertise on. Their goal with their media campaign is to identify as many news articles that mention health and wellness as possible, but we will compare model results with a money-saving oriented goal as well.
This is called contextual advertising: finding the URLs that match the context in which you’d like your ad to be show
Our challenge now is to build a deep learning algorithm that predicts the probability that a news story is about health and wellness using k-train.
Dataset¶
This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from Huffpost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles. Each news headline has a corresponding category (Health, Wellness, Entertainment, Politics, Sports, etc.).
To prep for machine learning, we created classes "Health" and "Not_Health", where "Health" is entries containing the labeled categories HEALTH or WELLNESS, and "Not_Health" is all others. This came to a split of about 176k articles for not_health and about 24k for health.
Methodology¶
To detect whether a news article belongs to the health/wellness category, a fine-tuned a transformer based text classifier using the ktrain wrapper over TensorFlow/Keras and HuggingFace transformers was implemented.
For the input representation, each article was represented by concatrenating the headline and short_Description, separated by a [SEP] token. Including the short_description should provide richer context than using headlines alone. Before training, empty descriptions were replaced with empty strings, and duplicated/null rows were removed.
Model¶
A BERT-based classifier provided by ktrain.text_classifer("bert", ...) was used. BERT embeddings capture semantic meaning at the subword and sentence level, making them well-suited for contextual classification. The model was configured with a maximuim sequence of 96 tokens, which should balance the need to capture multi-sentence inputs while keeping computation manageable.
Class Imbalance¶
The dataset is highly imbalanced at 176k to 24k entires, to counter this, class weights were implemented with sklearn.utils.class_weight. The minority was upweighted while the majority class was downweighted. During training, this re-scales the loss function so that misclassifying health articles is penalized more heavily, preventing the model from defaulting to predicting the majority class.
Training Strategy¶
The one-cycle learning rate policy (fast convergence) with a maximum learning rate of 2e-5 for 3 epochs was utilized, a standard setup for fine-tuning BERT on medium sized text classification tasks. Batch size (# of samples that will be propagated through the network) was rasied to 92 and early stoppage was not implemented since the goal is only 3 epochs.
Evaluation¶
The dataset was split 80/20 into training and validation sets. Model performance was evaluated on the validation set using ktrain's build in validate method, which reports precision, recall, F1-score, and support for each class. This provides insight into overall accuracy and the trade-off between catching more health articles (recall) vs minimizing false positives (precision). A confusion matrix was included to better track the false negatives (missed opportunities) and false positives (wasted ad-spend).
Threshold Tuning: Waste Less Ad-Dollars Vs Missed Opportunities¶
Tuning the probability threshold can save ad-dollars by presenting ads to Health articles more confidently, but it will reduce the total number of real Health articles given. It can also be tuned to catch all the health articles, at the cost of ads wasted.
The basic probability threshold is p(health) >= 0.5, so when there is a probability of 0.51 for p(health), then it will classify it as "health".
This cutoff can be raised to get more confident classifications for the "health" class. This option can be useful for Theragun because ads will only be showed to high-confidence health pages. Maximizing precision can save more ad dollars.
The threshold can also be lowered to increase recall, maximizing the amount of positive ads we send, at the expense of more misplaced ads.
Results¶
After training a supervised BERT-based classifier on the HuffPost headlines and short descriptions, the performance was evaluated on a held-out validation set (20% of ~200k total entries). The task was binary: Health (articles tagged WELLNESS or HEALTHY LIVING), and Not_Health (all other labeled categories)
Threshold¶
The Validation set has about ~4900 Health and about 35100 Not_Health entries.
The higher the threshold, the less coverage of the data we get (rejecting actual health articles due to low confidence). At threshold 97%, it reached about an 0.89 precision, keeping about 88.4% coverage across our data.
While about 11.6% of total data was lost due to the threshold, this reduced the Health class by about 26% (rejecting ~26% of total health articles due to uncertainty), while keeping the number of false positives lower to maximize precision for less wasted ad dollars.
To instead maximize the amount of total health articles classified without regard to wasted ads, the threshold can be lowered. Lowering the threshold to 0.05 only misses 224 health articles out of 4900, but we would be sending out 3365 bad ads to get 4775 good ones. Compared to the base threshold which sends out 1708 bad ads to get 4429 good ones, I do not think lowering the threshold is worth it.
Raising the threshold to 0.97 means we are sending out 426 bad ads to get 3466 good ones. More information on the ad profitability are required to get the best threshold.
Classification Report & Confusion Matrix¶
Precision measures how many of the pages flagged as “Health” were truly Health. A precision of 0.8896 means that 88.96% of the ad placements would be on wellness-related pages (low wasted spend).
Recall measures how many of the actual Health pages were successfully caught. A recall of 0.9481 means detection of 94.81% of the available Health content (few missed opportunities). Rejected articles in the 0.97 and 0.05 thresholds are not factored into this recall calculation.
F1-score is the harmonic mean of precision and recall, balancing the two into a single number.
Support shows how many validation samples belonged to each class, which helps interpret the scores.
Macro Avg shows how the model performs on each class equally, even if one class is more rare.
Weighted Avg shows the average across all classes weighted by their support (number of samples)
------------------------------------------------------------------------------------Threshold: 0.05----------------------------------------------------------------------------------------
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Health | 0.58 | 0.95 | 0.72 | 4914 |
| Not_Health | 0.99 | 0.90 | 0.95 | 35159 |
| Accuracy | 0.91 | 40073 | ||
| Macro Avg | 0.79 | 0.9393 | 0.83 | 40073 |
| Weighted Avg | 0.94 | 0.91 | 0.92 | 40073 |
| Pred Health | Pred Not_Health | |
|---|---|---|
| True Health | 4690 | 224 |
| True Not_Health | 3365 | 31794 |
[4690, 224] -> Health articles correctly identified vs misclassified
[3365, 31794] -> Not_Health correctly identified vs misclassified
True positives (4690): Health articles correctly flagged
False neagtives (224): Health articles mistakenly flagged as Not_Health (lost opportunity)
False positives (3365): Not_Health predicted as Health (wasted ads)
True negatives (31794): Not-Health correctly predicted
------------------------------------------------------------------------------------Threshold: 0.50----------------------------------------------------------------------------------------
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Health | 0.72 | 0.90 | 0.80 | 4914 |
| Not_Health | 0.99 | 0.95 | 0.97 | 35159 |
| Accuracy | 0.95 | 40073 | ||
| Macro Avg | 0.85 | 0.93 | 0.89 | 40073 |
| Weighted Avg | 0.95 | 0.95 | 0.95 | 40073 |
| Pred Health | Pred Not_Health | |
|---|---|---|
| True Health | 4443 | 471 |
| True Not_Health | 1708 | 33451 |
[4443, 471] -> Health articles correctly identified vs misclassified
[1708, 33451] -> Not_Health correctly identified vs misclassified
True positives (4443): Health articles correctly flagged
False neagtives (471): Health articles mistakenly flagged as Not_Health (lost opportunity)
False positives (1708): Not_Health predicted as Health (wasted ads)
True negatives (33451): Not-Health correctly predicted
------------------------------------------------------------------------------------Threshold: 0.97----------------------------------------------------------------------------------------
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Health | 0.8896 | 0.9481 | 0.9179 | 3621 |
| Not_Health | 0.9940 | 0.9866 | 0.9903 | 31801 |
| Accuracy | 0.9827 | 35422 | ||
| Macro Avg | 0.9418 | 0.9673 | 0.9541 | 35422 |
| Weighted Avg | 0.9834 | 0.9827 | 0.9829 | 35422 |
| Pred Health | Pred Not_Health | |
|---|---|---|
| True Health | 3433 | 188 |
| True Not_Health | 426 | 31375 |
[3433, 188] -> Health articles correctly identified vs misclassified
[426, 31375] -> Not_Health correctly misclassified vs identified
True positives (3433): Health articles correctly flagged
False neagtives (188): Health articles mistakenly flagged as Not_Health (lost opportunity)
False positives (426): Not_Health predicted as Health (wasted ads)
True negatives (31375): Not-Health correctly predicted
Takeaway¶
At the 0.97 threshold, the model prioritized precision (minimizing wasted ads) while maintaining strong recall (capturing the majority of wellness content), but rejects about 26% of total health ads in our validation data due to poor confidence. This trade-off may be well-aligned with Theragun's campaign goals: ensuring their ads appear in high-quality health contexts without overspending on irrelevant inventory.
At 0.50 threshold, we identify an extra ~1000 total true (about 30% increase) health articles at the cost of more false positives. There are 4x more false positives which equates to about ~1300 more wasted ads compared to the 0.97 threshold, which could be costly.
At the 0.05 threshold, only about an extra ~300 health articles are identified at the cost of an 3365 extra bad health articles. More information on the profitability of the ads are required to know which approach is best.
Future Research & Improvements¶
Future research could look into the unconfident articles specifically, get to the root cause of why it is not confident, and adjust parameters from there.
If there is a probability that misclassified ads are still converted into a sale, then misclassifying a proportion of ads may be more tolerable. After running the campaign, analyze the misclassified ads to check if any converted. From there, the company can possibly expand the target domain to include other categories.
Possible improvements to increase recall and maximize the number of good health ads sent: increase weights to the Health class, undersample not_health or oversample health, increasing the token length to 128, try a different model backbone (RoBERTa-base or DistilBERT), or supplement the current model with a new model (TF-IDF, LR/SVM) that flags extra health candidates and union the positives.
import json
import pandas as pd
path = "news_category_trainingdata.json"
with open(path, "r", encoding="utf-8") as f:
data = json.load(f)
# Each top-level key is a column, each value is a dict of {row_index: value}
# Rebuild rows by zipping values across columns
df = pd.DataFrame({col: pd.Series(vals) for col, vals in data.items()})
# Reset index and drop NAs
df = df.reset_index(drop=True)
df = df.dropna(subset=["category", "headline"])
print(df.shape)
print(df.head())
print(df["category"].value_counts().head())
(200853, 6)
category headline \
0 CRIME There Were 2 Mass Shootings In Texas Last Week...
1 ENTERTAINMENT Will Smith Joins Diplo And Nicky Jam For The 2...
2 ENTERTAINMENT Hugh Grant Marries For The First Time At Age 57
3 ENTERTAINMENT Jim Carrey Blasts 'Castrato' Adam Schiff And D...
4 ENTERTAINMENT Julianna Margulies Uses Donald Trump Poop Bags...
authors link \
0 Melissa Jeltsen https://www.huffingtonpost.com/entry/texas-ama...
1 Andy McDonald https://www.huffingtonpost.com/entry/will-smit...
2 Ron Dicker https://www.huffingtonpost.com/entry/hugh-gran...
3 Ron Dicker https://www.huffingtonpost.com/entry/jim-carre...
4 Ron Dicker https://www.huffingtonpost.com/entry/julianna-...
short_description date
0 She left her husband. He killed their children... 2018-05-26
1 Of course it has a song. 2018-05-26
2 The actor and his longtime girlfriend Anna Ebe... 2018-05-26
3 The actor gives Dems an ass-kicking for not fi... 2018-05-26
4 The "Dietland" actress said using the bags is ... 2018-05-26
category
POLITICS 32739
WELLNESS 17827
ENTERTAINMENT 16058
TRAVEL 9887
STYLE & BEAUTY 9649
Name: count, dtype: int64
# Add Labels for Health/Wellness
health_cats = {"WELLNESS", "HEALTHY LIVING"}
df["label_str"] = df["category"].apply(lambda x: "Health" if x in health_cats else "Not_Health")
print(df["label_str"].value_counts())
label_str Not_Health 176332 Health 24521 Name: count, dtype: int64
import ktrain
from ktrain import text
import warnings
import numpy as np
from sklearn.utils.class_weight import compute_class_weight
warnings.filterwarnings("ignore")
#Build a stronger input field: headline + short_description
df["short_description"] = df["short_description"].fillna("")
df["text"] = (df["headline"].astype(str).str.strip()
+ " [SEP] "
+ df["short_description"].astype(str).str.strip())
# Clean/dedup
df = df.dropna(subset=["text", "label_str"])
df = df[df["text"].str.len() > 0].drop_duplicates(subset=["text"]).reset_index(drop=True)
# Class names must match labels
class_names = ["Health", "Not_Health"]
# Split into train/val and preprocess with BERT (96 tokens)
(x_train, y_train), (x_val, y_val), preproc = text.texts_from_array(
x_train = df["text"].values,
y_train = df["label_str"].values,
class_names = class_names,
val_pct = 0.2,
maxlen = 96,
preprocess_mode = "bert",
random_state = 42
)
# Class-weighted loss to counter imbalance
classes = np.arange(len(preproc.get_classes())) # e.g., [0,1]
y_train_ids = y_train.argmax(1) if getattr(y_train, "ndim", 1) == 2 else y_train
cw = compute_class_weight(class_weight="balanced", classes=classes, y=y_train_ids)
class_weight = {int(i): float(w) for i, w in zip(classes, cw)}
print("Class weights:", class_weight)
# Build classifier
model = text.text_classifier("bert", (x_train, y_train), preproc=preproc)
# Wrap in learner
learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_val, y_val), batch_size=96)
# Train (3 epochs usually enough for BERT on this dataset) (wont need early stoppage with only 3 epoch)
learner.fit_onecycle(2e-5, 3, class_weight=class_weight)
# Evaluate
actual_classes = preproc.get_classes()
print("ktrain class order:", actual_classes)
learner.validate(val_data=(x_val, y_val), class_names=actual_classes)
preprocessing train... language: en
Is Multi-Label? False preprocessing test... language: en
task: text classification
Class weights: {0: 4.089916309450908, 1: 0.5696395064536305}
Is Multi-Label? False
maxlen is 96
done.
begin training using onecycle policy with max lr of 2e-05...
Epoch 1/3
5010/5010 [==============================] - 31246s 6s/step - loss: 0.2216 - accuracy: 0.9074 - val_loss: 0.2146 - val_accuracy: 0.9109
Epoch 2/3
5010/5010 [==============================] - 28989s 6s/step - loss: 0.1566 - accuracy: 0.9315 - val_loss: 0.1772 - val_accuracy: 0.9300
Epoch 3/3
5010/5010 [==============================] - 28980s 6s/step - loss: 0.0803 - accuracy: 0.9644 - val_loss: 0.1639 - val_accuracy: 0.9456
ktrain class order: ['Health', 'Not_Health']
1253/1253 [==============================] - 2123s 2s/step
precision recall f1-score support
Health 0.72 0.90 0.80 4914
Not_Health 0.99 0.95 0.97 35159
accuracy 0.95 40073
macro avg 0.85 0.93 0.89 40073
weighted avg 0.95 0.95 0.95 40073
array([[ 4443, 471],
[ 1708, 33451]], dtype=int64)
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
# Use the tensors returned by texts_from_array
x_inputs = x_val # tuple/list of arrays for transformers
y_true = y_val.argmax(1) if y_val.ndim == 2 else y_val
# Model probabilities -> class ids
probs = learner.model.predict(x_inputs, verbose=0)
y_pred = probs.argmax(1)
# Class names from the preprocessor (source of truth)
names = preproc.get_classes()
print(classification_report(y_true, y_pred, target_names=names, digits=4))
print(confusion_matrix(y_true, y_pred))
precision recall f1-score support
Health 0.7223 0.9042 0.8031 4914
Not_Health 0.9861 0.9514 0.9685 35159
accuracy 0.9456 40073
macro avg 0.8542 0.9278 0.8858 40073
weighted avg 0.9538 0.9456 0.9482 40073
[[ 4443 471]
[ 1708 33451]]
# setting a threshold of 0.75 for both classes, rejecting the rest, rechecking validation
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix
def evaluate_with_reject_from_probs(probs, y_val, classes, threshold=0.75):
"""
Evaluate with a reject option using precomputed class probabilities.
probs: np.ndarray, shape (N, 2) -> model probabilities per class
y_val: one-hot or int labels
classes: list like ['Health', 'Not_Health']
"""
# true labels as ints
y_true = y_val.argmax(1) if getattr(y_val, "ndim", 1) == 2 else y_val
# class indices
h_idx = classes.index('Health')
nh_idx = classes.index('Not_Health')
p_health = probs[:, h_idx]
p_not = probs[:, nh_idx]
# apply thresholds with reject (to not reduce recall)
y_pred = np.full_like(y_true, fill_value=-1)
y_pred[p_health >= threshold] = h_idx
y_pred[(p_not >= threshold) & (y_pred == -1)] = nh_idx
# keep confident only
mask = (y_pred != -1)
kept_true, kept_pred = y_true[mask], y_pred[mask]
print(f"Threshold = {threshold}")
print(f"Kept {mask.sum()}/{len(y_pred)} samples ({100*mask.mean():.1f}% coverage)\n")
if mask.sum() == 0:
print("No samples met the threshold.")
return
print("Classification Report (confident samples only):")
print(classification_report(kept_true, kept_pred, target_names=classes, digits=4))
print("Confusion Matrix:")
print(confusion_matrix(kept_true, kept_pred))
# get probs from the model (since x_val is tokenized tensors)
probs = learner.model.predict(x_val, verbose=0) # shape (N, 2)
classes = preproc.get_classes() # ['Health','Not_Health']
# run evaluation with reject threshold=0.75
evaluate_with_reject_from_probs(probs, y_val, classes, threshold=0.75)
Threshold = 0.75
Kept 38738/40073 samples (96.7% coverage)
Classification Report (confident samples only):
precision recall f1-score support
Health 0.7762 0.9253 0.8442 4603
Not_Health 0.9897 0.9640 0.9767 34135
accuracy 0.9594 38738
macro avg 0.8829 0.9446 0.9104 38738
weighted avg 0.9643 0.9594 0.9609 38738
Confusion Matrix:
[[ 4259 344]
[ 1228 32907]]
print(" run evaluation with reject threshold=0.80")
evaluate_with_reject_from_probs(probs, y_val, classes, threshold=0.80)
print(" run evaluation with reject threshold=0.85")
evaluate_with_reject_from_probs(probs, y_val, classes, threshold=0.85)
print(" run evaluation with reject threshold=0.90")
evaluate_with_reject_from_probs(probs, y_val, classes, threshold=0.90)
print(" run evaluation with reject threshold=0.95")
evaluate_with_reject_from_probs(probs, y_val, classes, threshold=0.95)
print(" run evaluation with reject threshold 0.97")
evaluate_with_reject_from_probs(probs, y_val, classes, threshold=0.97)
print(" run evaluation with reject threshold 0.98")
evaluate_with_reject_from_probs(probs, y_val, classes, threshold=0.98)
print(" run evaluation with reject threshold 0.99")
evaluate_with_reject_from_probs(probs, y_val, classes, threshold=0.99)
run evaluation with reject threshold=0.80
Threshold = 0.8
Kept 38374/40073 samples (95.8% coverage)
Classification Report (confident samples only):
precision recall f1-score support
Health 0.7909 0.9297 0.8547 4509
Not_Health 0.9904 0.9673 0.9787 33865
accuracy 0.9629 38374
macro avg 0.8907 0.9485 0.9167 38374
weighted avg 0.9670 0.9629 0.9641 38374
Confusion Matrix:
[[ 4192 317]
[ 1108 32757]]
run evaluation with reject threshold=0.85
Threshold = 0.85
Kept 37948/40073 samples (94.7% coverage)
Classification Report (confident samples only):
precision recall f1-score support
Health 0.8074 0.9339 0.8661 4400
Not_Health 0.9911 0.9708 0.9809 33548
accuracy 0.9665 37948
macro avg 0.8993 0.9523 0.9235 37948
weighted avg 0.9698 0.9665 0.9675 37948
Confusion Matrix:
[[ 4109 291]
[ 980 32568]]
run evaluation with reject threshold=0.90
Threshold = 0.9
Kept 37342/40073 samples (93.2% coverage)
Classification Report (confident samples only):
precision recall f1-score support
Health 0.8289 0.9387 0.8804 4228
Not_Health 0.9920 0.9753 0.9836 33114
accuracy 0.9711 37342
macro avg 0.9105 0.9570 0.9320 37342
weighted avg 0.9736 0.9711 0.9719 37342
Confusion Matrix:
[[ 3969 259]
[ 819 32295]]
run evaluation with reject threshold=0.95
Threshold = 0.95
Kept 36292/40073 samples (90.6% coverage)
Classification Report (confident samples only):
precision recall f1-score support
Health 0.8641 0.9428 0.9017 3917
Not_Health 0.9930 0.9821 0.9875 32375
accuracy 0.9778 36292
macro avg 0.9285 0.9624 0.9446 36292
weighted avg 0.9791 0.9778 0.9782 36292
Confusion Matrix:
[[ 3693 224]
[ 581 31794]]
run evaluation with reject threshold 0.97
Threshold = 0.97
Kept 35422/40073 samples (88.4% coverage)
Classification Report (confident samples only):
precision recall f1-score support
Health 0.8896 0.9481 0.9179 3621
Not_Health 0.9940 0.9866 0.9903 31801
accuracy 0.9827 35422
macro avg 0.9418 0.9673 0.9541 35422
weighted avg 0.9834 0.9827 0.9829 35422
Confusion Matrix:
[[ 3433 188]
[ 426 31375]]
run evaluation with reject threshold 0.98
Threshold = 0.98
Kept 34607/40073 samples (86.4% coverage)
Classification Report (confident samples only):
precision recall f1-score support
Health 0.9096 0.9507 0.9297 3324
Not_Health 0.9947 0.9900 0.9923 31283
accuracy 0.9862 34607
macro avg 0.9522 0.9703 0.9610 34607
weighted avg 0.9866 0.9862 0.9863 34607
Confusion Matrix:
[[ 3160 164]
[ 314 30969]]
run evaluation with reject threshold 0.99
Threshold = 0.99
Kept 33082/40073 samples (82.6% coverage)
Classification Report (confident samples only):
precision recall f1-score support
Health 0.9435 0.9487 0.9461 2709
Not_Health 0.9954 0.9949 0.9952 30373
accuracy 0.9911 33082
macro avg 0.9694 0.9718 0.9706 33082
weighted avg 0.9912 0.9911 0.9912 33082
Confusion Matrix:
[[ 2570 139]
[ 154 30219]]
print(" run evaluation with reject threshold 0.05")
evaluate_with_reject_from_probs(probs, y_val, classes, threshold=0.05)
run evaluation with reject threshold 0.05
Threshold = 0.05
Kept 40073/40073 samples (100.0% coverage)
Classification Report (confident samples only):
precision recall f1-score support
Health 0.5822 0.9544 0.7233 4914
Not_Health 0.9930 0.9043 0.9466 35159
accuracy 0.9104 40073
macro avg 0.7876 0.9294 0.8349 40073
weighted avg 0.9426 0.9104 0.9192 40073
Confusion Matrix:
[[ 4690 224]
[ 3365 31794]]
first_description = df['short_description'].iloc[2]
first_headline = df['headline'].iloc[2]
print(first_headline)
print(first_description)
Hugh Grant Marries For The First Time At Age 57 The actor and his longtime girlfriend Anna Eberstein tied the knot in a civil ceremony.