Training Imagenet on Ubuntu 18.04

This blog talks about how to set up ubuntu 18.04 to train CNN model on imagenet. The steps are

To save time, start with step 4 to download imagenet dataset first.

Install CUDA

No need to update graphic card driver, the driver will be updated with CUDA installation. Partially following this blog post, my commands are

$ wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
$ sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub
$ sudo apt-get update
$ sudo apt-get -y install cuda

Then I add

# CUDA Config - ~/.bashrc
export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

to ~/.bashrc, and run source ~/.bashrc to apply the changes.

Install CUDNN

go to https://developer.nvidia.com/rdp/cudnn-download, register and download all 3 .deb files: the runtime library, the developer library, and the code samples library for Ubuntu 18.04.

Go to your download folder, install them in the same order:

$ sudo dpkg -i libcudnn7_7.5.0.56-1+cuda10.0_amd64.deb (the runtime library),

$ sudo dpkg -i libcudnn7-dev_7.5.0.56-1+cuda10.0_amd64.deb (the developer library),

$ sudo dpkg -i libcudnn7-doc_7.5.0.56-1+cuda10.0_amd64.deb (the code samples).

Install pytorch

Ubuntu 18.04 has python 3.6 preinstalled. So we will work with python 3.6.

create virtual environment

$ sudo apt-get install -y python3-venv
$ mkdir environments
$ cd environments
$ python3 -m venv pytorch-env
$ source pytorch-env/bin/activate	# use environment

on success, we should see

(pytorch-env) user-name@user-pc:~/environments$

install pytorch

if pip3 is not installed, install with

$ sudo apt-get install -y pip3

then

$ pip3 install torch torchvision

Prepare imagenet dataset

The official site for imagenet goes down periodically, the fastest and easiest way to downlad the ILSVRC2012 dataset is from http://academictorrents.com/browse.php?search=imagenet. There are 1.2 milion iamges of 1000 categories totalling around 150GB.

To batch extract the files and move them into corrected folders, use a combination of the following scripts

extract.sh

for a in $(ls -1 *.tar); 
do 
	mkdir ${a%.*};
	tar -xvf $a -C ${a%.*}; 
done

extractAndRemove.sh

for a in $(ls -1 *.tar); 
do 
	mkdir ${a%.*};
	tar -xvf $a -C ${a%.*}; 
	rm $a;
done

removeTars.sh

for a in $(ls -1 *.tar); 
do 
	rm $a; 
done

This script helps to put validation images into folders. The resultant folder structure should look like

ILSVRC2012_folder
	train
		n01440764
		n01443537
		...
	val
		n01440764
		n01443537
		...		
```		
## Training
Following <a href="https://zhuanlan.zhihu.com/p/67919205" target="blank">this post</a>, clone https://github.com/pytorch/examples, and run the training script under imagenet/ folder
```bash
python3 main.py -a resnet18 ILSVRC2012_folder -b 32 |& tee train_log.txt

GTX1080Ti cannot handle batch size greater than 32. On this hardware, each epoch takes 7.37 hours. After 34 epochs, the accuracy is

Training: Acc@1 44.70 Acc@5 69.38

Validation: Acc@1 51.64 Acc@5 77.35

The best model is stored in model_best.pth.tar. This script only creats one extra checkpoint.

Here’s an example training log

Epoch: [34][39930/40037]	Time  2.353 ( 0.646)	Data  2.326 ( 0.604)	Loss 2.3454e+00 (2.5339e+00)	Acc@1  50.00 ( 44.70)	Acc@5  71.88 ( 69.39)
Epoch: [34][39940/40037]	Time  0.049 ( 0.646)	Data  0.000 ( 0.604)	Loss 2.7097e+00 (2.5339e+00)	Acc@1  43.75 ( 44.70)	Acc@5  68.75 ( 69.39)
Epoch: [34][39950/40037]	Time  2.296 ( 0.646)	Data  2.272 ( 0.604)	Loss 2.1355e+00 (2.5339e+00)	Acc@1  59.38 ( 44.70)	Acc@5  75.00 ( 69.39)
Epoch: [34][39960/40037]	Time  0.050 ( 0.646)	Data  0.000 ( 0.604)	Loss 3.1379e+00 (2.5339e+00)	Acc@1  40.62 ( 44.70)	Acc@5  59.38 ( 69.39)
Epoch: [34][39970/40037]	Time  2.179 ( 0.646)	Data  2.152 ( 0.604)	Loss 2.8691e+00 (2.5340e+00)	Acc@1  46.88 ( 44.70)	Acc@5  56.25 ( 69.39)
Epoch: [34][39980/40037]	Time  0.050 ( 0.646)	Data  0.000 ( 0.604)	Loss 1.9133e+00 (2.5339e+00)	Acc@1  53.12 ( 44.70)	Acc@5  81.25 ( 69.39)
Epoch: [34][39990/40037]	Time  2.397 ( 0.646)	Data  2.374 ( 0.604)	Loss 2.7276e+00 (2.5340e+00)	Acc@1  50.00 ( 44.70)	Acc@5  56.25 ( 69.39)
Epoch: [34][40000/40037]	Time  0.050 ( 0.646)	Data  0.000 ( 0.604)	Loss 1.9445e+00 (2.5339e+00)	Acc@1  53.12 ( 44.70)	Acc@5  78.12 ( 69.39)
Epoch: [34][40010/40037]	Time  0.894 ( 0.646)	Data  0.869 ( 0.604)	Loss 2.4231e+00 (2.5340e+00)	Acc@1  43.75 ( 44.70)	Acc@5  68.75 ( 69.39)
Epoch: [34][40020/40037]	Time  0.507 ( 0.646)	Data  0.485 ( 0.604)	Loss 2.1194e+00 (2.5339e+00)	Acc@1  56.25 ( 44.70)	Acc@5  75.00 ( 69.39)
Epoch: [34][40030/40037]	Time  0.960 ( 0.646)	Data  0.922 ( 0.604)	Loss 2.4572e+00 (2.5340e+00)	Acc@1  40.62 ( 44.70)	Acc@5  75.00 ( 69.38)
Test: [   0/1563]	Time 13.309 (13.309)	Loss 1.4311e+00 (1.4311e+00)	Acc@1  65.62 ( 65.62)	Acc@5  90.62 ( 90.62)
Test: [  10/1563]	Time  0.016 ( 1.437)	Loss 1.8978e+00 (1.5550e+00)	Acc@1  34.38 ( 55.40)	Acc@5  90.62 ( 88.35)
Test: [  20/1563]	Time  0.016 ( 0.883)	Loss 7.2487e-01 (1.3800e+00)	Acc@1  84.38 ( 63.39)	Acc@5  93.75 ( 88.69)
Test: [  30/1563]	Time  0.016 ( 0.728)	Loss 5.2000e-01 (1.2296e+00)	Acc@1  81.25 ( 67.24)	Acc@5  93.75 ( 89.62)
Test: [  40/1563]	Time  0.018 ( 0.626)	Loss 1.3063e+00 (1.2341e+00)	Acc@1  68.75 ( 67.30)	Acc@5  93.75 ( 89.63)
Test: [  50/1563]	Time  0.018 ( 0.568)	Loss 3.4426e+00 (1.3470e+00)	Acc@1  34.38 ( 66.67)	Acc@5  62.50 ( 88.11)
Test: [  60/1563]	Time  0.018 ( 0.516)	Loss 2.0711e+00 (1.5121e+00)	Acc@1  50.00 ( 63.27)	Acc@5  78.12 ( 85.91)
Test: [  70/1563]	Time  0.016 ( 0.493)	Loss 2.5851e+00 (1.6045e+00)	Acc@1  53.12 ( 61.14)	Acc@5  68.75 ( 84.24)
Test: [  80/1563]	Time  0.539 ( 0.464)	Loss 1.9009e+00 (1.6847e+00)	Acc@1  65.62 ( 60.07)	Acc@5  75.00 ( 83.18)
Test: [  90/1563]	Time  0.016 ( 0.452)	Loss 2.1300e+00 (1.7213e+00)	Acc@1  46.88 ( 58.76)	Acc@5  78.12 ( 82.97)
Test: [ 100/1563]	Time  0.254 ( 0.446)	Loss 2.6598e+00 (1.8422e+00)	Acc@1  37.50 ( 56.40)	Acc@5  65.62 ( 81.16)
Test: [ 110/1563]	Time  0.393 ( 0.437)	Loss 1.4576e+00 (1.8606e+00)	Acc@1  56.25 ( 55.97)	Acc@5  84.38 ( 80.86)
Test: [ 120/1563]	Time  0.321 ( 0.425)	Loss 8.7615e-01 (1.8282e+00)	Acc@1  75.00 ( 56.53)	Acc@5  96.88 ( 81.40)
Test: [ 130/1563]	Time  0.033 ( 0.413)	Loss 5.4003e-01 (1.7993e+00)	Acc@1  84.38 ( 57.25)	Acc@5 100.00 ( 81.66)

MNIST is a handwriting dataset, training it is much easier. Go to the mnist folder, and run python3 main.py, the dataset will be downloaded automatically. 10 epochs are trained in about 2 minutes, and the validation set accuracy is 98.99%.