Analyzing Representation Transfer and Attention in Facial Expression Recognition

Published: April 01, 2026

This project studies how representation transfer changes both accuracy and attention behavior in facial expression recognition (FER). By comparing CNNs trained from scratch, transfer-learning pipelines, and Vision Transformers on FER2013, the work asks not only which model performs best, but also which facial regions each model actually relies on when predicting emotion.

Beyond standard FER benchmarking, the project introduces a quantitative attention analysis framework and an attention-guided training objective to measure and improve interpretability. The main goal is to connect recognition performance with semantically meaningful visual evidence rather than treating explanation quality as an afterthought.

🔗 Project Repository:
https://github.com/RohitPoduval1/csci5527-project

Problem Motivation

Facial expression recognition models often achieve strong classification performance, but it remains unclear:

Which facial regions models rely on
How pretrained representations influence attention
Whether models attend to semantically meaningful facial features

This project studies the relationship between:

representation transfer
model architecture
attention behavior
recognition performance

Key questions:

Does pretraining improve emotion-specific representations?
Do transformers attend more globally than CNNs?
Can attention regularization improve interpretability?

Dataset

We use the FER2013 dataset, a widely used benchmark for facial expression recognition.

Dataset Statistics

35,887 images
7 emotion classes

Emotion categories:

Angry
Disgust
Fear
Happy
Sad
Surprise
Neutral

Image Characteristics

grayscale facial images
resolution: 48 × 48
noisy labels
large variations in
- lighting
- occlusion
- facial pose

Data Preprocessing

Image Processing

Input images are converted and resized to match pretrained model requirements.

48x48 grayscale
→ convert to RGB
→ resize to 224x224

Normalization

Images are normalized using ImageNet statistics:

\[x' = \frac{x - \mu}{\sigma}\]

Data Augmentation

To improve generalization:

random horizontal flip
random rotation
random crop
color jitter
Gaussian noise

Optional robustness techniques:

label smoothing
MixUp augmentation

Facial Landmark Detection

To analyze attention behavior, facial landmarks are detected using:

MediaPipe
dlib

The face is segmented into semantic regions:

eyes
eyebrows
mouth
nose
face contour
background

Binary masks are generated for each region, enabling quantitative measurement of attention distributions.

Model Architectures

We train four model types to study the impact of representation transfer.

Model 1 — CNN Baseline

A simple CNN trained from scratch.

Architecture:

Conv → ReLU → MaxPool
Conv → ReLU → MaxPool
Conv → ReLU → MaxPool
Fully Connected
Softmax

Purpose:

establish baseline performance
demonstrate limitations of training from scratch

Expected accuracy:

60–65%

Model 2 — CNN with ImageNet Transfer

We evaluate transfer learning using pretrained CNN backbones.

Example architectures:

ResNet50
EfficientNet

Training procedure:

load pretrained network
replace final classification layer(FC → 7 emotion classes)
train classifier head
fine-tune upper layers

Expected accuracy:

~70%

Model 3 — VGGFace Transfer

We test domain-specific transfer learning using models pretrained on face recognition.

Example backbone:

VGGFace

Hypothesis:

Face-recognition pretraining may suppress expression features because it focuses on identity rather than emotion.

Model 4 — Vision Transformer

We compare CNNs with transformer-based vision models.

Example model:

timm.create_model("vit_base_patch16_224", pretrained=True)

Training Setup

Training configuration:

optimizer: Adam
batch size: 64
epochs: 30
learning rate: 1e-4

Classification loss:

\[L = - \sum_i y_i \log(p_i)\]

Regularization techniques:

dropout
weight decay
label smoothing

Explainability Analysis

To understand model attention behavior, we apply explainability methods.

CNN Models

Grad-CAM
Guided Grad-CAM

These methods produce spatial heatmaps highlighting image regions influencing predictions.

Transformer Models

For Vision Transformers we analyze:

attention rollout
self-attention maps

Self-attention is computed as:

\[Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d}}\right)V\]

These visualizations allow direct comparison between CNN and transformer attention behavior.

Quantitative Attention Analysis (Novel Component)

Instead of relying solely on visual heatmaps, we introduce a quantitative attention metric.

For each facial region:

\[Attention_{region} = \frac{\sum_{pixels} Heatmap \times Mask}{\sum Heatmap}\]

Regions analyzed:

eyes
mouth
eyebrows
background

Example comparison:

Model	Mouth	Eyes	Background
CNN baseline	28%	19%	41%
CNN ImageNet	35%	27%	25%
VGGFace	12%	40%	26%
ViT	30%	30%	15%

This provides objective evaluation of model interpretability.

Attention-Guided Training (Novel Component)

We further propose an attention regularization loss encouraging models to focus on relevant facial regions.

Modified training objective:

\[L = L_{cls} + \lambda L_{attention}\]

Where

\[L_{attention} = \sum_{background} Heatmap\]

Purpose:

penalize background attention
encourage focus on meaningful facial features

This improves both interpretability and robustness.

Optional Experiment — Multi-Scale Input

We also evaluate whether multi-scale inputs improve FER performance.

Architecture:

Face crop
+
Whole image
→ concatenated feature representation
→ classifier

Goal:

Capture both:

local facial expression cues
global facial context

Evaluation Metrics

Classification Metrics

Accuracy
Precision
Recall
F1 Score
Confusion Matrix

Interpretability Metrics

attention distribution across facial regions
background attention ratio

Experimental Analysis

Domain Transfer

Key question:

Does face-recognition pretraining help or hinder emotion recognition?

Expected observations:

VGGFace emphasizes identity-related features
ImageNet pretrained models generalize better for emotion recognition

Architecture Differences

CNNs

strong local feature extraction

Transformers

global attention modeling
holistic understanding of facial expressions

Attention-Guided Training