Fine-Tuning BERT for NLP Tasks: GLUE and SQuAD Code Examples

Fine-Tuning BERT for NLP Tasks

Fine-tuning BERT for GLUE and SQuAD tasks requires adapting pre-trained models to specific datasets. The SST-2 task in GLUE has 67,349 training examples, while SQuAD contains 87,599 training samples for question answering.

Why This Matters

BERT’s pre-training does not guarantee task-specific performance. Fine-tuning adds task-specific heads and adapts the model to downstream tasks, but poor tokenizer alignment or insufficient training data (e.g., WikiText-2) can lead to suboptimal results. The SQuAD example shows initial validation accuracy of just 4.5%, highlighting the cost of improper pre-training.

Key Insights

“SST-2 task in GLUE has 67,349 training examples”: https://machinelearningmastery.com/fine-tuning-a-bert-model/
“SQuAD dataset contains 87,599 training samples”: https://machinelearningmastery.com/fine-tuning-a-bert-model/
“Hugging Face’s datasets library used for GLUE/SQuAD loading”: https://huggingface.co/docs/datasets/

Working Example

from datasets import load_dataset
from tokenizers import Tokenizer
import torch
import torch.nn as nn
import torch.optim as optim

# Load GLUE dataset (SST-2)
task = "sst2"
dataset = load_dataset("glue", task)
num_labels = 2

# Tokenizer setup
tokenizer = Tokenizer.from_file("wikitext-2_wordpiece.json")

# Collate function for batching
def collate(batch, tokenizer, max_len):
    # [Implementation from context]
    pass

# BERT model for sequence classification
class BertForSequenceClassification(nn.Module):
    def __init__(self, config, num_labels):
        super().__init__()
        self.bert = BertModel(config)
        self.classifier = nn.Linear(config.hidden_size, num_labels)

    def forward(self, input_ids, pad_id=0):
        # [Implementation from context]
        pass

# Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = BertConfig()
model = BertForSequenceClassification(config, num_labels).to(device)
model.bert.load_state_dict(torch.load("bert_model.pth", map_location=device))

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
num_epochs = 3

for epoch in range(num_epochs):
    model.train()
    # [Training loop from context]

Practical Applications

Use Case: GLUE for sentiment analysis (e.g., SST-2 task)
Pitfall: Using a small pre-trained BERT model (e.g., WikiText-2) leads to poor performance on complex tasks

References:

On This Page