Binary Classification of Histopathologic Cancer Images using ResNet18¶
1. Introduction¶
In this project, we approached the binary classification problem of histopathologic scans for detecting metastatic cancer in lymph node sections using the PCam dataset. The dataset consists of image patches labeled as either containing tumor tissue (label 1) or not (label 0). Our goal was to develop a convolutional neural network (CNN) model that classifies unseen test images accurately and produces a Kaggle competition submission.
2. Dataset Description¶
Training images: 220,025 RGB .tif files, each of size 96x96, with corresponding labels provided in train_labels.csv.
Test images: 57,456 unlabeled .tif files.
Labels: Binary classification — 0 (no tumor), 1 (tumor).
Train/Validation split: We used an 80/20 stratified split on the training set.
Downloaded here on kaggle: https://www.kaggle.com/competitions/histopathologic-cancer-detection/data
3. Data Preparation & EDA¶
We created a custom PyTorch Dataset class (PCamDataset) that loads the image file from the disk and maps it to its corresponding label using the CSV. A separate TestPCamDataset was used for the unlabeled test images.
Transformations We applied minimal augmentation to speed up training.
For EDA, we looked at the label distribution and a couple of sample pictures.
4. Model Architecture¶
We created our own set of CNN models, a simple version and a smarter version, but in the end went with ResNet18 pre trained model for a better score.
We used a pretrained ResNet18 model from torchvision.models, replacing the final fully connected layer to match our binary output:
This transfer learning approach allows us to leverage ImageNet feature extraction capabilities while fine-tuning only the final classifier for tumor detection.
5. Training Setup¶
Loss function: CrossEntropyLoss
Optimizer: Adam with learning rate 0.0005
Epochs: 10
Device: CUDA if available, else CPU
Batch size: 64
No early stopping or LR scheduling used
6. Evaluation Metrics¶
Accuracy, Precision, Recall, F1-Score on validation set
Confusion Matrix
ROC Curve and AUC
Results on validation set (sample output):
Accuracy: 0.9625 F1-score (macro avg): 0.9612
However, on Kaggle test set:
Public Score: 0.7348
7. Test Set Predictions¶
The final model was evaluated on the test set using a simplified transform (no augmentation), and predictions were generated as follows:
outputs = model(images) _, preds = torch.max(outputs, 1) submission_df = pd.DataFrame(predictions, columns=['id', 'label']) submission_df.to_csv("submission.csv", index=False)
8. Discussion & Limitations¶
Although the model performed well on validation data (96% accuracy), its Kaggle test score was only 0.7348. This discrepancy suggests potential overfitting or data leakage, or possibly:
Insufficient augmentation or regularization
Train/val split mismatch vs. Kaggle hidden test distribution
Model not capturing subtle texture patterns without deeper architecture
9. Conclusion¶
We demonstrated a working ResNet18 classifier using PyTorch for cancer detection on PCam data. While the model generalized moderately to the test set, future work should include:
Stratified k-fold cross-validation
More aggressive augmentation
Advanced architectures (e.g., EfficientNet, DenseNet)
Regularization techniques like early stopping and LR scheduling
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="torchvision.models._utils")
import os
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
from torchvision import transforms
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
# Paths
train_labels_path = 'train_labels.csv'
train_folder = 'train'
test_folder = 'test'
# Load training labels
df = pd.read_csv(train_labels_path)
# Custom Dataset Class
class PCamDataset(Dataset):
def __init__(self, dataframe, image_folder, transform=None):
self.dataframe = dataframe
self.image_folder = image_folder
self.transform = transform
def __len__(self):
return len(self.dataframe)
def __getitem__(self, idx):
image_id = self.dataframe.iloc[idx, 0]
label = self.dataframe.iloc[idx, 1]
img_path = os.path.join(self.image_folder, image_id + '.tif')
image = Image.open(img_path)
if self.transform:
image = self.transform(image)
return image, label
# Transforms
transform = transforms.Compose([
transforms.ToTensor()
])
# Train-validation split
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)
# Datasets
train_dataset = PCamDataset(train_df, train_folder, transform=transform)
val_dataset = PCamDataset(val_df, train_folder, transform=transform)
# DataLoaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)
# EDA: Show label distribution
label_counts = df['label'].value_counts()
sns.barplot(x=label_counts.index, y=label_counts.values)
plt.title('Label Distribution')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()
sample_images = df.sample(5, random_state=42)
fig, axes = plt.subplots(1, 5, figsize=(15, 3))
for ax, (idx, row) in zip(axes, sample_images.iterrows()):
image_id = row['id']
label = row['label']
image_path = os.path.join(train_folder, image_id + '.tif')
image = Image.open(image_path)
ax.imshow(image)
ax.set_title(f"Label: {label}")
ax.axis('off')
plt.tight_layout()
plt.show()
import torch.nn.functional as F
# Define the CNN architecture
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
# Dummy forward to infer flattened size
dummy_input = torch.zeros(1, 3, 96, 96) # Adjust this to match your actual image size
x = self.pool(F.relu(self.conv1(dummy_input)))
x = self.pool(F.relu(self.conv2(x)))
self.flattened_size = x.view(1, -1).shape[1]
self.fc1 = nn.Linear(self.flattened_size, 128)
self.fc2 = nn.Linear(128, 2)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
# Initialize model, loss, optimizer
model = SimpleCNN()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
train_losses, val_losses = [], []
train_accuracies, val_accuracies = [], []
num_epochs = 5
for epoch in range(num_epochs):
model.train()
running_loss, correct, total = 0.0, 0, 0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item() * images.size(0)
_, preds = torch.max(outputs, 1)
correct += (preds == labels).sum().item()
total += labels.size(0)
train_losses.append(running_loss / total)
train_accuracies.append(correct / total)
# Validation
model.eval()
val_loss, correct, total = 0.0, 0, 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
val_loss += loss.item() * images.size(0)
_, preds = torch.max(outputs, 1)
correct += (preds == labels).sum().item()
total += labels.size(0)
val_losses.append(val_loss / total)
val_accuracies.append(correct / total)
# Plot training and validation accuracy/loss over time
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
axs[0].plot(train_losses, label="Train Loss")
axs[0].plot(val_losses, label="Val Loss")
axs[0].set_title("Loss over Epochs")
axs[0].set_xlabel("Epoch")
axs[0].set_ylabel("Loss")
axs[0].legend()
axs[1].plot(train_accuracies, label="Train Accuracy")
axs[1].plot(val_accuracies, label="Val Accuracy")
axs[1].set_title("Accuracy over Epochs")
axs[1].set_xlabel("Epoch")
axs[1].set_ylabel("Accuracy")
axs[1].legend()
plt.tight_layout()
plt.show()
Below we will attempt to implement a smarter version of CNN.
import torch.nn as nn
import torch.nn.functional as F
class SmartCNN(nn.Module):
def __init__(self):
super(SmartCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(32)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(64)
self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(128)
self.pool = nn.MaxPool2d(2, 2)
self.dropout = nn.Dropout(0.4)
# Global Average Pool to reduce to (batch_size, 128, 1, 1)
self.global_pool = nn.AdaptiveAvgPool2d((1, 1))
# Fully connected layer: 128 → 2 (binary classification)
self.fc = nn.Linear(128, 2)
def forward(self, x):
x = self.pool(F.relu(self.bn1(self.conv1(x)))) # (32x32 → 16x16)
x = self.pool(F.relu(self.bn2(self.conv2(x)))) # (16x16 → 8x8)
x = self.pool(F.relu(self.bn3(self.conv3(x)))) # (8x8 → 4x4)
x = self.global_pool(x) # (4x4 → 1x1)
x = x.view(x.size(0), -1) # flatten to (batch_size, 128)
x = self.dropout(x)
x = self.fc(x) # final logits
return x
# Initialize model, loss, optimizer
model = SmartCNN()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
train_losses, val_losses = [], []
train_accuracies, val_accuracies = [], []
num_epochs = 5
for epoch in range(num_epochs):
model.train()
running_loss, correct, total = 0.0, 0, 0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item() * images.size(0)
_, preds = torch.max(outputs, 1)
correct += (preds == labels).sum().item()
total += labels.size(0)
train_losses.append(running_loss / total)
train_accuracies.append(correct / total)
# Validation
model.eval()
val_loss, correct, total = 0.0, 0, 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
val_loss += loss.item() * images.size(0)
_, preds = torch.max(outputs, 1)
correct += (preds == labels).sum().item()
total += labels.size(0)
val_losses.append(val_loss / total)
val_accuracies.append(correct / total)
# Plot training and validation accuracy/loss over time
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
axs[0].plot(train_losses, label="Train Loss")
axs[0].plot(val_losses, label="Val Loss")
axs[0].set_title("Loss over Epochs")
axs[0].set_xlabel("Epoch")
axs[0].set_ylabel("Loss")
axs[0].legend()
axs[1].plot(train_accuracies, label="Train Accuracy")
axs[1].plot(val_accuracies, label="Val Accuracy")
axs[1].set_title("Accuracy over Epochs")
axs[1].set_xlabel("Epoch")
axs[1].set_ylabel("Accuracy")
axs[1].legend()
plt.tight_layout()
plt.show()
below we will show confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
# Gather all validation predictions and true labels
all_preds, all_labels = [], []
model.eval()
with torch.no_grad():
for images, labels in val_loader:
outputs = model(images.to(device))
_, preds = torch.max(outputs, 1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(labels.numpy())
cm = confusion_matrix(all_labels, all_preds)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.show()
below we will show the classification report
from sklearn.metrics import classification_report
print(classification_report(all_labels, all_preds, digits=4))
precision recall f1-score support 0 0.9141 0.9473 0.9304 26177 1 0.9183 0.8693 0.8931 17828 accuracy 0.9157 44005 macro avg 0.9162 0.9083 0.9118 44005 weighted avg 0.9158 0.9157 0.9153 44005
below we will show the roc auc
from sklearn.metrics import roc_curve, auc
# For ROC, need probability scores (for class 1)
probs = []
model.eval()
with torch.no_grad():
for images, _ in val_loader:
outputs = model(images.to(device))
probs.extend(F.softmax(outputs, dim=1)[:,1].cpu().numpy())
fpr, tpr, _ = roc_curve(all_labels, probs)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.4f}")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.show()
below we will see some sample predictions
import random
model.eval()
with torch.no_grad():
batch = next(iter(val_loader))
images, labels = batch
outputs = model(images.to(device))
_, preds = torch.max(outputs, 1)
fig, axs = plt.subplots(1, 5, figsize=(15, 3))
for i in range(5):
axs[i].imshow(images[i].permute(1, 2, 0))
axs[i].set_title(f"Pred: {preds[i].item()} | True: {labels[i].item()}")
axs[i].axis('off')
plt.tight_layout()
plt.show()
trying the model with resnet and data augmentation @ 10 epochs
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
from torchvision import transforms
# Use pretrained ResNet18 and replace the classifier
model = models.resnet18(pretrained=True)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 2) # Binary classification
# Data augmentation + normalization for training
transform = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
transforms.ToTensor()
])
# Initialize model, loss, optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0005)
# Training loop
train_losses, val_losses = [], []
train_accuracies, val_accuracies = [], []
num_epochs = 10
for epoch in range(num_epochs):
model.train()
running_loss, correct, total = 0.0, 0, 0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item() * images.size(0)
_, preds = torch.max(outputs, 1)
correct += (preds == labels).sum().item()
total += labels.size(0)
train_losses.append(running_loss / total)
train_accuracies.append(correct / total)
# Validation
model.eval()
val_loss, correct, total = 0.0, 0, 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
val_loss += loss.item() * images.size(0)
_, preds = torch.max(outputs, 1)
correct += (preds == labels).sum().item()
total += labels.size(0)
val_losses.append(val_loss / total)
val_accuracies.append(correct / total)
# Plot training and validation accuracy/loss over time
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
axs[0].plot(train_losses, label="Train Loss")
axs[0].plot(val_losses, label="Val Loss")
axs[0].set_title("Loss over Epochs")
axs[0].set_xlabel("Epoch")
axs[0].set_ylabel("Loss")
axs[0].legend()
axs[1].plot(train_accuracies, label="Train Accuracy")
axs[1].plot(val_accuracies, label="Val Accuracy")
axs[1].set_title("Accuracy over Epochs")
axs[1].set_xlabel("Epoch")
axs[1].set_ylabel("Accuracy")
axs[1].legend()
plt.tight_layout()
plt.show()
# Gather all validation predictions and true labels for confusion matrix and classification report
all_preds, all_labels = [], []
model.eval()
with torch.no_grad():
for images, labels in val_loader:
outputs = model(images.to(device))
_, preds = torch.max(outputs, 1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(labels.numpy())
# Classification report
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
print(classification_report(all_labels, all_preds, digits=4))
cm = confusion_matrix(all_labels, all_preds)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.show()
#ROC
# For ROC, need probability scores (for class 1)
probs = []
model.eval()
with torch.no_grad():
for images, _ in val_loader:
outputs = model(images.to(device))
probs.extend(F.softmax(outputs, dim=1)[:,1].cpu().numpy())
fpr, tpr, _ = roc_curve(all_labels, probs)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.4f}")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.show()
C:\Users\forca\anaconda3\Lib\site-packages\torchvision\models\_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( C:\Users\forca\anaconda3\Lib\site-packages\torchvision\models\_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights. warnings.warn(msg)
precision recall f1-score support 0 0.9690 0.9681 0.9685 26177 1 0.9532 0.9545 0.9538 17828 accuracy 0.9625 44005 macro avg 0.9611 0.9613 0.9612 44005 weighted avg 0.9626 0.9625 0.9626 44005