Image Classification: CNNs vs Vision Transformers

Background & Objective

Comparative analysis of three neural network architectures for image classification on CIFAR-10: baseline CNN, DenseNet, and a pre-trained Vision Transformer (ViT). The primary goal was to demonstrate the effectiveness of parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA) on state-of-the-art vision transformers while comparing performance against traditional convolutional approaches.

Technical Stack

Framework: PyTorch, Torchvision
Fine-Tuning: PEFT library with LoRA
Hardware: NVIDIA Tesla V100 GPU with CUDA

Implementation

Fine-tuned DINOv2-Small ViT using LoRA, updating only 8.06% of the model's 22.06M parameters. Leveraged pre-trained weights from self-supervised learning and applied parameter-efficient fine-tuning to dramatically reduce training time and computational costs while maintaining high accuracy. Implemented systematic comparison across three architectures with comprehensive benchmarking on CIFAR-10.

Results

Model	Test Accuracy	Test Loss	Total Parameters	Fine-Tuned Params
Shallow CNN	83.46%	0.5299	~4.8M	100%
DenseNet	91.22%	0.3692	~7.4M	100%
DINOv2-Small (ViT)	95.95%	0.1243	22.06M	8.06%

Achieved 95.95% test accuracy using parameter-efficient LoRA fine-tuning, demonstrating the superiority of modern transformer architectures and transfer learning for image classification tasks.

Image Classification: Comparing CNNs and Vision Transformers

Background & Objective

Technical Stack

Implementation

Results

Project Links