Training Imagenet on Ubuntu 18.04

This blog talks about how to set up ubuntu 18.04 to train CNN model on imagenet. The steps are

To save time, start with step 4 to download imagenet dataset first.

Install CUDA

No need to update graphic card driver, the driver will be updated with CUDA installation. Partially following this blog post, my commands are

$ wget
$ sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
$ sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/
$ sudo apt-get update
$ sudo apt-get -y install cuda

Then I add

# CUDA Config - ~/.bashrc
export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

to ~/.bashrc, and run source ~/.bashrc to apply the changes.

Install CUDNN

go to, register and download all 3 .deb files: the runtime library, the developer library, and the code samples library for Ubuntu 18.04.

Go to your download folder, install them in the same order:

$ sudo dpkg -i libcudnn7_7.5.0.56-1+cuda10.0_amd64.deb (the runtime library),

$ sudo dpkg -i libcudnn7-dev_7.5.0.56-1+cuda10.0_amd64.deb (the developer library),

$ sudo dpkg -i libcudnn7-doc_7.5.0.56-1+cuda10.0_amd64.deb (the code samples).

Install pytorch

Ubuntu 18.04 has python 3.6 preinstalled. So we will work with python 3.6.

create virtual environment

$ sudo apt-get install -y python3-venv
$ mkdir environments
$ cd environments
$ python3 -m venv pytorch-env
$ source pytorch-env/bin/activate	# use environment

on success, we should see

(pytorch-env) user-name@user-pc:~/environments$

install pytorch

if pip3 is not installed, install with

$ sudo apt-get install -y pip3


$ pip3 install torch torchvision

Prepare imagenet dataset

The official site for imagenet goes down periodically, the fastest and easiest way to downlad the ILSVRC2012 dataset is from There are 1.2 milion iamges of 1000 categories totalling around 150GB.

To batch extract the files and move them into corrected folders, use a combination of the following scripts

for a in $(ls -1 *.tar); 
	mkdir ${a%.*};
	tar -xvf $a -C ${a%.*}; 

for a in $(ls -1 *.tar); 
	mkdir ${a%.*};
	tar -xvf $a -C ${a%.*}; 
	rm $a;

for a in $(ls -1 *.tar); 
	rm $a; 

This script helps to put validation images into folders. The resultant folder structure should look like

## Training
Following <a href="" target="blank">this post</a>, clone, and run the training script under imagenet/ folder
python3 -a resnet18 ILSVRC2012_folder -b 32 |& tee train_log.txt

GTX1080Ti cannot handle batch size greater than 32. On this hardware, each epoch takes 7.37 hours. After 34 epochs, the accuracy is

Training: Acc@1 44.70 Acc@5 69.38

Validation: Acc@1 51.64 Acc@5 77.35

The best model is stored in model_best.pth.tar. This script only creats one extra checkpoint.

Here’s an example training log

Epoch: [34][39930/40037]	Time  2.353 ( 0.646)	Data  2.326 ( 0.604)	Loss 2.3454e+00 (2.5339e+00)	Acc@1  50.00 ( 44.70)	Acc@5  71.88 ( 69.39)
Epoch: [34][39940/40037]	Time  0.049 ( 0.646)	Data  0.000 ( 0.604)	Loss 2.7097e+00 (2.5339e+00)	Acc@1  43.75 ( 44.70)	Acc@5  68.75 ( 69.39)
Epoch: [34][39950/40037]	Time  2.296 ( 0.646)	Data  2.272 ( 0.604)	Loss 2.1355e+00 (2.5339e+00)	Acc@1  59.38 ( 44.70)	Acc@5  75.00 ( 69.39)
Epoch: [34][39960/40037]	Time  0.050 ( 0.646)	Data  0.000 ( 0.604)	Loss 3.1379e+00 (2.5339e+00)	Acc@1  40.62 ( 44.70)	Acc@5  59.38 ( 69.39)
Epoch: [34][39970/40037]	Time  2.179 ( 0.646)	Data  2.152 ( 0.604)	Loss 2.8691e+00 (2.5340e+00)	Acc@1  46.88 ( 44.70)	Acc@5  56.25 ( 69.39)
Epoch: [34][39980/40037]	Time  0.050 ( 0.646)	Data  0.000 ( 0.604)	Loss 1.9133e+00 (2.5339e+00)	Acc@1  53.12 ( 44.70)	Acc@5  81.25 ( 69.39)
Epoch: [34][39990/40037]	Time  2.397 ( 0.646)	Data  2.374 ( 0.604)	Loss 2.7276e+00 (2.5340e+00)	Acc@1  50.00 ( 44.70)	Acc@5  56.25 ( 69.39)
Epoch: [34][40000/40037]	Time  0.050 ( 0.646)	Data  0.000 ( 0.604)	Loss 1.9445e+00 (2.5339e+00)	Acc@1  53.12 ( 44.70)	Acc@5  78.12 ( 69.39)
Epoch: [34][40010/40037]	Time  0.894 ( 0.646)	Data  0.869 ( 0.604)	Loss 2.4231e+00 (2.5340e+00)	Acc@1  43.75 ( 44.70)	Acc@5  68.75 ( 69.39)
Epoch: [34][40020/40037]	Time  0.507 ( 0.646)	Data  0.485 ( 0.604)	Loss 2.1194e+00 (2.5339e+00)	Acc@1  56.25 ( 44.70)	Acc@5  75.00 ( 69.39)
Epoch: [34][40030/40037]	Time  0.960 ( 0.646)	Data  0.922 ( 0.604)	Loss 2.4572e+00 (2.5340e+00)	Acc@1  40.62 ( 44.70)	Acc@5  75.00 ( 69.38)
Test: [   0/1563]	Time 13.309 (13.309)	Loss 1.4311e+00 (1.4311e+00)	Acc@1  65.62 ( 65.62)	Acc@5  90.62 ( 90.62)
Test: [  10/1563]	Time  0.016 ( 1.437)	Loss 1.8978e+00 (1.5550e+00)	Acc@1  34.38 ( 55.40)	Acc@5  90.62 ( 88.35)
Test: [  20/1563]	Time  0.016 ( 0.883)	Loss 7.2487e-01 (1.3800e+00)	Acc@1  84.38 ( 63.39)	Acc@5  93.75 ( 88.69)
Test: [  30/1563]	Time  0.016 ( 0.728)	Loss 5.2000e-01 (1.2296e+00)	Acc@1  81.25 ( 67.24)	Acc@5  93.75 ( 89.62)
Test: [  40/1563]	Time  0.018 ( 0.626)	Loss 1.3063e+00 (1.2341e+00)	Acc@1  68.75 ( 67.30)	Acc@5  93.75 ( 89.63)
Test: [  50/1563]	Time  0.018 ( 0.568)	Loss 3.4426e+00 (1.3470e+00)	Acc@1  34.38 ( 66.67)	Acc@5  62.50 ( 88.11)
Test: [  60/1563]	Time  0.018 ( 0.516)	Loss 2.0711e+00 (1.5121e+00)	Acc@1  50.00 ( 63.27)	Acc@5  78.12 ( 85.91)
Test: [  70/1563]	Time  0.016 ( 0.493)	Loss 2.5851e+00 (1.6045e+00)	Acc@1  53.12 ( 61.14)	Acc@5  68.75 ( 84.24)
Test: [  80/1563]	Time  0.539 ( 0.464)	Loss 1.9009e+00 (1.6847e+00)	Acc@1  65.62 ( 60.07)	Acc@5  75.00 ( 83.18)
Test: [  90/1563]	Time  0.016 ( 0.452)	Loss 2.1300e+00 (1.7213e+00)	Acc@1  46.88 ( 58.76)	Acc@5  78.12 ( 82.97)
Test: [ 100/1563]	Time  0.254 ( 0.446)	Loss 2.6598e+00 (1.8422e+00)	Acc@1  37.50 ( 56.40)	Acc@5  65.62 ( 81.16)
Test: [ 110/1563]	Time  0.393 ( 0.437)	Loss 1.4576e+00 (1.8606e+00)	Acc@1  56.25 ( 55.97)	Acc@5  84.38 ( 80.86)
Test: [ 120/1563]	Time  0.321 ( 0.425)	Loss 8.7615e-01 (1.8282e+00)	Acc@1  75.00 ( 56.53)	Acc@5  96.88 ( 81.40)
Test: [ 130/1563]	Time  0.033 ( 0.413)	Loss 5.4003e-01 (1.7993e+00)	Acc@1  84.38 ( 57.25)	Acc@5 100.00 ( 81.66)

MNIST is a handwriting dataset, training it is much easier. Go to the mnist folder, and run python3, the dataset will be downloaded automatically. 10 epochs are trained in about 2 minutes, and the validation set accuracy is 98.99%.