Training Imagenet on Ubuntu 18.04

Nov 9, 2019
cnn computer_vision linux machine_learning
5 min read

This blog talks about how to set up ubuntu 18.04 to train CNN model on imagenet. The steps are

Install CUDA
Install CUDNN
Install pytorch
Prepare imagenet dataset
Training

To save time, start with step 4 to download imagenet dataset first.

Install CUDA

No need to update graphic card driver, the driver will be updated with CUDA installation. Partially following this blog post, my commands are

$ wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
$ sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub
$ sudo apt-get update
$ sudo apt-get -y install cuda

Then I add

# CUDA Config - ~/.bashrc
export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

to ~/.bashrc, and run source ~/.bashrc to apply the changes.

Install CUDNN

go to https://developer.nvidia.com/rdp/cudnn-download, register and download all 3 .deb files: the runtime library, the developer library, and the code samples library for Ubuntu 18.04.

Go to your download folder, install them in the same order:

$ sudo dpkg -i libcudnn7_7.5.0.56-1+cuda10.0_amd64.deb (the runtime library),

$ sudo dpkg -i libcudnn7-dev_7.5.0.56-1+cuda10.0_amd64.deb (the developer library),

$ sudo dpkg -i libcudnn7-doc_7.5.0.56-1+cuda10.0_amd64.deb (the code samples).

Install pytorch

Ubuntu 18.04 has python 3.6 preinstalled. So we will work with python 3.6.

create virtual environment

$ sudo apt-get install -y python3-venv
$ mkdir environments
$ cd environments
$ python3 -m venv pytorch-env
$ source pytorch-env/bin/activate	# use environment

on success, we should see

(pytorch-env) user-name@user-pc:~/environments$

install pytorch

if pip3 is not installed, install with

$ sudo apt-get install -y pip3

then

$ pip3 install torch torchvision

Prepare imagenet dataset

The official site for imagenet goes down periodically, the fastest and easiest way to downlad the ILSVRC2012 dataset is from http://academictorrents.com/browse.php?search=imagenet. There are 1.2 milion iamges of 1000 categories totalling around 150GB.

To batch extract the files and move them into corrected folders, use a combination of the following scripts

extract.sh

for a in $(ls -1 *.tar); 
do 
	mkdir ${a%.*};
	tar -xvf $a -C ${a%.*}; 
done

extractAndRemove.sh

for a in $(ls -1 *.tar); 
do 
	mkdir ${a%.*};
	tar -xvf $a -C ${a%.*}; 
	rm $a;
done

removeTars.sh

for a in $(ls -1 *.tar); 
do 
	rm $a; 
done

This script helps to put validation images into folders. The resultant folder structure should look like

ILSVRC2012_folder
	train
		n01440764
		n01443537
		...
	val
		n01440764
		n01443537
		...		
```		
## Training
Following <a href="https://zhuanlan.zhihu.com/p/67919205" target="blank">this post</a>, clone https://github.com/pytorch/examples, and run the training script under imagenet/ folder
```bash
python3 main.py -a resnet18 ILSVRC2012_folder -b 32 |& tee train_log.txt

GTX1080Ti cannot handle batch size greater than 32. On this hardware, each epoch takes 7.37 hours. After 34 epochs, the accuracy is

Training: Acc@1 44.70 Acc@5 69.38

Validation: Acc@1 51.64 Acc@5 77.35

The best model is stored in model_best.pth.tar. This script only creats one extra checkpoint.

Here’s an example training log

Epoch: [34][39930/40037]	Time  2.353 ( 0.646)	Data  2.326 ( 0.604)	Loss 2.3454e+00 (2.5339e+00)	Acc@1  50.00 ( 44.70)	Acc@5  71.88 ( 69.39)
Epoch: [34][39940/40037]	Time  0.049 ( 0.646)	Data  0.000 ( 0.604)	Loss 2.7097e+00 (2.5339e+00)	Acc@1  43.75 ( 44.70)	Acc@5  68.75 ( 69.39)
Epoch: [34][39950/40037]	Time  2.296 ( 0.646)	Data  2.272 ( 0.604)	Loss 2.1355e+00 (2.5339e+00)	Acc@1  59.38 ( 44.70)	Acc@5  75.00 ( 69.39)
Epoch: [34][39960/40037]	Time  0.050 ( 0.646)	Data  0.000 ( 0.604)	Loss 3.1379e+00 (2.5339e+00)	Acc@1  40.62 ( 44.70)	Acc@5  59.38 ( 69.39)
Epoch: [34][39970/40037]	Time  2.179 ( 0.646)	Data  2.152 ( 0.604)	Loss 2.8691e+00 (2.5340e+00)	Acc@1  46.88 ( 44.70)	Acc@5  56.25 ( 69.39)
Epoch: [34][39980/40037]	Time  0.050 ( 0.646)	Data  0.000 ( 0.604)	Loss 1.9133e+00 (2.5339e+00)	Acc@1  53.12 ( 44.70)	Acc@5  81.25 ( 69.39)
Epoch: [34][39990/40037]	Time  2.397 ( 0.646)	Data  2.374 ( 0.604)	Loss 2.7276e+00 (2.5340e+00)	Acc@1  50.00 ( 44.70)	Acc@5  56.25 ( 69.39)
Epoch: [34][40000/40037]	Time  0.050 ( 0.646)	Data  0.000 ( 0.604)	Loss 1.9445e+00 (2.5339e+00)	Acc@1  53.12 ( 44.70)	Acc@5  78.12 ( 69.39)
Epoch: [34][40010/40037]	Time  0.894 ( 0.646)	Data  0.869 ( 0.604)	Loss 2.4231e+00 (2.5340e+00)	Acc@1  43.75 ( 44.70)	Acc@5  68.75 ( 69.39)
Epoch: [34][40020/40037]	Time  0.507 ( 0.646)	Data  0.485 ( 0.604)	Loss 2.1194e+00 (2.5339e+00)	Acc@1  56.25 ( 44.70)	Acc@5  75.00 ( 69.39)
Epoch: [34][40030/40037]	Time  0.960 ( 0.646)	Data  0.922 ( 0.604)	Loss 2.4572e+00 (2.5340e+00)	Acc@1  40.62 ( 44.70)	Acc@5  75.00 ( 69.38)
Test: [   0/1563]	Time 13.309 (13.309)	Loss 1.4311e+00 (1.4311e+00)	Acc@1  65.62 ( 65.62)	Acc@5  90.62 ( 90.62)
Test: [  10/1563]	Time  0.016 ( 1.437)	Loss 1.8978e+00 (1.5550e+00)	Acc@1  34.38 ( 55.40)	Acc@5  90.62 ( 88.35)
Test: [  20/1563]	Time  0.016 ( 0.883)	Loss 7.2487e-01 (1.3800e+00)	Acc@1  84.38 ( 63.39)	Acc@5  93.75 ( 88.69)
Test: [  30/1563]	Time  0.016 ( 0.728)	Loss 5.2000e-01 (1.2296e+00)	Acc@1  81.25 ( 67.24)	Acc@5  93.75 ( 89.62)
Test: [  40/1563]	Time  0.018 ( 0.626)	Loss 1.3063e+00 (1.2341e+00)	Acc@1  68.75 ( 67.30)	Acc@5  93.75 ( 89.63)
Test: [  50/1563]	Time  0.018 ( 0.568)	Loss 3.4426e+00 (1.3470e+00)	Acc@1  34.38 ( 66.67)	Acc@5  62.50 ( 88.11)
Test: [  60/1563]	Time  0.018 ( 0.516)	Loss 2.0711e+00 (1.5121e+00)	Acc@1  50.00 ( 63.27)	Acc@5  78.12 ( 85.91)
Test: [  70/1563]	Time  0.016 ( 0.493)	Loss 2.5851e+00 (1.6045e+00)	Acc@1  53.12 ( 61.14)	Acc@5  68.75 ( 84.24)
Test: [  80/1563]	Time  0.539 ( 0.464)	Loss 1.9009e+00 (1.6847e+00)	Acc@1  65.62 ( 60.07)	Acc@5  75.00 ( 83.18)
Test: [  90/1563]	Time  0.016 ( 0.452)	Loss 2.1300e+00 (1.7213e+00)	Acc@1  46.88 ( 58.76)	Acc@5  78.12 ( 82.97)
Test: [ 100/1563]	Time  0.254 ( 0.446)	Loss 2.6598e+00 (1.8422e+00)	Acc@1  37.50 ( 56.40)	Acc@5  65.62 ( 81.16)
Test: [ 110/1563]	Time  0.393 ( 0.437)	Loss 1.4576e+00 (1.8606e+00)	Acc@1  56.25 ( 55.97)	Acc@5  84.38 ( 80.86)
Test: [ 120/1563]	Time  0.321 ( 0.425)	Loss 8.7615e-01 (1.8282e+00)	Acc@1  75.00 ( 56.53)	Acc@5  96.88 ( 81.40)
Test: [ 130/1563]	Time  0.033 ( 0.413)	Loss 5.4003e-01 (1.7993e+00)	Acc@1  84.38 ( 57.25)	Acc@5 100.00 ( 81.66)

MNIST is a handwriting dataset, training it is much easier. Go to the mnist folder, and run python3 main.py, the dataset will be downloaded automatically. 10 epochs are trained in about 2 minutes, and the validation set accuracy is 98.99%.

Boyi's Log