← Back to Research

Image Classification: Comparing CNNs and Vision Transformers

May 2025 • Machine Learning

PyTorch Torchvision PEFT (LoRA) NumPy CUDA

Background & Objective

Comparative analysis of three neural network architectures for image classification on CIFAR-10: baseline CNN, DenseNet, and a pre-trained Vision Transformer (ViT). The primary goal was to demonstrate the effectiveness of parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA) on state-of-the-art vision transformers while comparing performance against traditional convolutional approaches.

Technical Stack

Framework: PyTorch, Torchvision
Fine-Tuning: PEFT library with LoRA
Hardware: NVIDIA Tesla V100 GPU with CUDA

Implementation

Fine-tuned DINOv2-Small ViT using LoRA, updating only 8.06% of the model's 22.06M parameters. Leveraged pre-trained weights from self-supervised learning and applied parameter-efficient fine-tuning to dramatically reduce training time and computational costs while maintaining high accuracy. Implemented systematic comparison across three architectures with comprehensive benchmarking on CIFAR-10.

Results

Model Test Accuracy Test Loss Total Parameters Fine-Tuned Params
Shallow CNN 83.46% 0.5299 ~4.8M 100%
DenseNet 91.22% 0.3692 ~7.4M 100%
DINOv2-Small (ViT) 95.95% 0.1243 22.06M 8.06%

Achieved 95.95% test accuracy using parameter-efficient LoRA fine-tuning, demonstrating the superiority of modern transformer architectures and transfer learning for image classification tasks.