Disaster Tweets Classification Using DistilBERT¶

Introduction¶

This project tackles the Natural Language Processing with Disaster Tweets Kaggle competition, where the goal is to classify tweets as real disaster-related (1) or not (0). We employed a state-of-the-art Transformer-based language model, DistilBERT, to learn and generalize from text data with minimal preprocessing.

Data Overview & EDA¶

The dataset consists of 7,613 labeled tweets, with each sample containing:

text: the content of the tweet

keyword and location: additional metadata (not used in our model)

target: binary label (1 = disaster, 0 = non-disaster)

EDA Observations:

Class imbalance is mild (57% non-disaster, 43% disaster)

Tweet lengths are mostly under 100 characters

Some missing values exist in keyword and location — we dropped these columns

Data can be downloaded on Kaggle here: https://www.kaggle.com/competitions/nlp-getting-started/data

Preprocessing & Tokenization¶

We used Hugging Face’s DistilBertTokenizerFast to:

Truncate/pad tweets to 128 tokens

Convert text into model-ready token IDs

Retain attention masks for downstream modeling

Model Architecture¶

We fine-tuned DistilBertForSequenceClassification with a binary classification head:

Pretrained on distilbert-base-uncased

Final layer outputs two logits for class 0 and 1

Model trained with AdamW optimizer (lr=5e-5)

Training Procedure¶

3 training epochs using a manual PyTorch loop

Batch size: 16 (train), 64 (validation)

Model trained on GPU via CUDA

Validation accuracy tracked each epoch

Training Results:

Epoch 1: Train Loss = 0.4402, Val Accuracy = 0.8319

Epoch 2: Train Loss = 0.3039, Val Accuracy = 0.8181

Epoch 3: Train Loss = 0.1903, Val Accuracy = 0.8188

Note: Slight overfitting observed by Epoch 2–3, possibly due to small validation set or under-regularization.

Test-Time Inference¶

We encoded the 3,263 unlabeled test tweets using the same tokenizer and evaluated the model to generate predictions. These were saved in submission.csv following Kaggle format.

Kaggle Result¶

Leaderboard Score: 0.81244 accuracy This result is consistent with a well-initialized but lightly-tuned transformer baseline and exceeds the baseline random or naive classifiers.

Discussion & Next Steps¶

The model demonstrates that transformer-based architectures can outperform traditional RNN or TF-IDF methods even with minimal tuning. However, improvements are possible:

Longer training with early stopping

Learning rate scheduling

Using auxiliary data (keyword, location)

Advanced techniques like ensembling or threshold tuning

In [3]:
# Install dependencies (if needed)
# !pip install transformers datasets seaborn scikit-learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments

# Load data
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# === EDA ===
print(train_df.info())
print(train_df['target'].value_counts())
sns.countplot(data=train_df, x='target')
plt.title("Class Distribution")
plt.show()

train_df['text_length'] = train_df['text'].apply(len)
sns.histplot(train_df['text_length'], bins=50)
plt.title("Tweet Length Distribution")
plt.show()

# Check missing data
print(train_df.isnull().sum())

# Drop unused columns
train_df = train_df[['text', 'target']]
test_ids = test_df['id']
test_df = test_df[['text']]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB
None
target
0    4342
1    3271
Name: count, dtype: int64
No description has been provided for this image
C:\Users\forca\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
No description has been provided for this image
id                0
keyword          61
location       2533
text              0
target            0
text_length       0
dtype: int64
In [6]:
# 2. Tokenize
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=128)

# 3. Dataset class
class TweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = TweetDataset(train_encodings, train_labels)
val_dataset = TweetDataset(val_encodings, val_labels)

# 4. Model setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
model.to(device)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# 5. Manual training loop
for epoch in range(3):
    model.train()
    total_loss = 0
    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}, Train Loss: {total_loss/len(train_loader):.4f}")

    # Evaluation
    model.eval()
    preds, true_labels = [], []
    with torch.no_grad():
        for batch in val_loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            logits = outputs.logits
            preds.extend(torch.argmax(logits, axis=1).cpu().numpy())
            true_labels.extend(batch['labels'].cpu().numpy())

    acc = accuracy_score(true_labels, preds)
    print(f"Validation Accuracy: {acc:.4f}")
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1, Train Loss: 0.4402
Validation Accuracy: 0.8319
Epoch 2, Train Loss: 0.3039
Validation Accuracy: 0.8181
Epoch 3, Train Loss: 0.1903
Validation Accuracy: 0.8188
In [12]:
# === Kaggle Submission ===

# 1. Load test data
test_df = pd.read_csv("test.csv")

# 2. Tokenize
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
test_encodings = tokenizer(
    test_df["text"].tolist(),
    truncation=True,
    padding=True,
    max_length=128,
    return_tensors="pt"
)

# 3. Dataset class
class TweetTestDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __len__(self):
        return len(self.encodings["input_ids"])
    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.encodings.items()}

test_dataset = TweetTestDataset(test_encodings)

# 4. Inference
model.eval()
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32)
all_preds = []

with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1)
        all_preds.extend(preds.cpu().numpy())

# 5. Format for Kaggle
submission = pd.DataFrame({
    "id": test_df["id"],
    "target": all_preds
})
submission.to_csv("submission.csv", index=False)
In [ ]: