Disaster Tweets Classification Using DistilBERT¶
Introduction¶
This project tackles the Natural Language Processing with Disaster Tweets Kaggle competition, where the goal is to classify tweets as real disaster-related (1) or not (0). We employed a state-of-the-art Transformer-based language model, DistilBERT, to learn and generalize from text data with minimal preprocessing.
Data Overview & EDA¶
The dataset consists of 7,613 labeled tweets, with each sample containing:
text: the content of the tweet
keyword and location: additional metadata (not used in our model)
target: binary label (1 = disaster, 0 = non-disaster)
EDA Observations:
Class imbalance is mild (57% non-disaster, 43% disaster)
Tweet lengths are mostly under 100 characters
Some missing values exist in keyword and location — we dropped these columns
Data can be downloaded on Kaggle here: https://www.kaggle.com/competitions/nlp-getting-started/data
Preprocessing & Tokenization¶
We used Hugging Face’s DistilBertTokenizerFast to:
Truncate/pad tweets to 128 tokens
Convert text into model-ready token IDs
Retain attention masks for downstream modeling
Model Architecture¶
We fine-tuned DistilBertForSequenceClassification with a binary classification head:
Pretrained on distilbert-base-uncased
Final layer outputs two logits for class 0 and 1
Model trained with AdamW optimizer (lr=5e-5)
Training Procedure¶
3 training epochs using a manual PyTorch loop
Batch size: 16 (train), 64 (validation)
Model trained on GPU via CUDA
Validation accuracy tracked each epoch
Training Results:
Epoch 1: Train Loss = 0.4402, Val Accuracy = 0.8319
Epoch 2: Train Loss = 0.3039, Val Accuracy = 0.8181
Epoch 3: Train Loss = 0.1903, Val Accuracy = 0.8188
Note: Slight overfitting observed by Epoch 2–3, possibly due to small validation set or under-regularization.
Test-Time Inference¶
We encoded the 3,263 unlabeled test tweets using the same tokenizer and evaluated the model to generate predictions. These were saved in submission.csv following Kaggle format.
Kaggle Result¶
Leaderboard Score: 0.81244 accuracy This result is consistent with a well-initialized but lightly-tuned transformer baseline and exceeds the baseline random or naive classifiers.
Discussion & Next Steps¶
The model demonstrates that transformer-based architectures can outperform traditional RNN or TF-IDF methods even with minimal tuning. However, improvements are possible:
Longer training with early stopping
Learning rate scheduling
Using auxiliary data (keyword, location)
Advanced techniques like ensembling or threshold tuning
# Install dependencies (if needed)
# !pip install transformers datasets seaborn scikit-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
# Load data
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
# === EDA ===
print(train_df.info())
print(train_df['target'].value_counts())
sns.countplot(data=train_df, x='target')
plt.title("Class Distribution")
plt.show()
train_df['text_length'] = train_df['text'].apply(len)
sns.histplot(train_df['text_length'], bins=50)
plt.title("Tweet Length Distribution")
plt.show()
# Check missing data
print(train_df.isnull().sum())
# Drop unused columns
train_df = train_df[['text', 'target']]
test_ids = test_df['id']
test_df = test_df[['text']]
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7613 entries, 0 to 7612 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 7613 non-null int64 1 keyword 7552 non-null object 2 location 5080 non-null object 3 text 7613 non-null object 4 target 7613 non-null int64 dtypes: int64(2), object(3) memory usage: 297.5+ KB None target 0 4342 1 3271 Name: count, dtype: int64
C:\Users\forca\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True):
id 0 keyword 61 location 2533 text 0 target 0 text_length 0 dtype: int64
# 2. Tokenize
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=128)
# 3. Dataset class
class TweetDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = TweetDataset(train_encodings, train_labels)
val_dataset = TweetDataset(val_encodings, val_labels)
# 4. Model setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
model.to(device)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
# 5. Manual training loop
for epoch in range(3):
model.train()
total_loss = 0
for batch in train_loader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
total_loss += loss.item()
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Train Loss: {total_loss/len(train_loader):.4f}")
# Evaluation
model.eval()
preds, true_labels = [], []
with torch.no_grad():
for batch in val_loader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
logits = outputs.logits
preds.extend(torch.argmax(logits, axis=1).cpu().numpy())
true_labels.extend(batch['labels'].cpu().numpy())
acc = accuracy_score(true_labels, preds)
print(f"Validation Accuracy: {acc:.4f}")
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1, Train Loss: 0.4402 Validation Accuracy: 0.8319 Epoch 2, Train Loss: 0.3039 Validation Accuracy: 0.8181 Epoch 3, Train Loss: 0.1903 Validation Accuracy: 0.8188
# === Kaggle Submission ===
# 1. Load test data
test_df = pd.read_csv("test.csv")
# 2. Tokenize
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
test_encodings = tokenizer(
test_df["text"].tolist(),
truncation=True,
padding=True,
max_length=128,
return_tensors="pt"
)
# 3. Dataset class
class TweetTestDataset(Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __len__(self):
return len(self.encodings["input_ids"])
def __getitem__(self, idx):
return {key: val[idx] for key, val in self.encodings.items()}
test_dataset = TweetTestDataset(test_encodings)
# 4. Inference
model.eval()
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32)
all_preds = []
with torch.no_grad():
for batch in test_loader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
logits = outputs.logits
preds = torch.argmax(logits, dim=1)
all_preds.extend(preds.cpu().numpy())
# 5. Format for Kaggle
submission = pd.DataFrame({
"id": test_df["id"],
"target": all_preds
})
submission.to_csv("submission.csv", index=False)