Training Imagenet on Ubuntu 18.04
This blog talks about how to set up ubuntu 18.04 to train CNN model on imagenet. The steps are
- Install CUDA
- Install CUDNN
- Install pytorch
- Prepare imagenet dataset
- Training
To save time, start with step 4 to download imagenet dataset first.
Install CUDA
No need to update graphic card driver, the driver will be updated with CUDA installation. Partially following this blog post, my commands are
$ wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
$ sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub
$ sudo apt-get update
$ sudo apt-get -y install cuda
Then I add
# CUDA Config - ~/.bashrc
export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
to ~/.bashrc, and run source ~/.bashrc
to apply the changes.
Install CUDNN
go to https://developer.nvidia.com/rdp/cudnn-download, register and download all 3 .deb files: the runtime library, the developer library, and the code samples library for Ubuntu 18.04.
Go to your download folder, install them in the same order:
$ sudo dpkg -i libcudnn7_7.5.0.56-1+cuda10.0_amd64.deb
(the runtime library),
$ sudo dpkg -i libcudnn7-dev_7.5.0.56-1+cuda10.0_amd64.deb
(the developer library),
$ sudo dpkg -i libcudnn7-doc_7.5.0.56-1+cuda10.0_amd64.deb
(the code samples).
Install pytorch
Ubuntu 18.04 has python 3.6 preinstalled. So we will work with python 3.6.
create virtual environment
$ sudo apt-get install -y python3-venv
$ mkdir environments
$ cd environments
$ python3 -m venv pytorch-env
$ source pytorch-env/bin/activate # use environment
on success, we should see
(pytorch-env) user-name@user-pc:~/environments$
install pytorch
if pip3 is not installed, install with
$ sudo apt-get install -y pip3
then
$ pip3 install torch torchvision
Prepare imagenet dataset
The official site for imagenet goes down periodically, the fastest and easiest way to downlad the ILSVRC2012 dataset is from http://academictorrents.com/browse.php?search=imagenet. There are 1.2 milion iamges of 1000 categories totalling around 150GB.
To batch extract the files and move them into corrected folders, use a combination of the following scripts
extract.sh
for a in $(ls -1 *.tar);
do
mkdir ${a%.*};
tar -xvf $a -C ${a%.*};
done
extractAndRemove.sh
for a in $(ls -1 *.tar);
do
mkdir ${a%.*};
tar -xvf $a -C ${a%.*};
rm $a;
done
removeTars.sh
for a in $(ls -1 *.tar);
do
rm $a;
done
This script helps to put validation images into folders. The resultant folder structure should look like
ILSVRC2012_folder
train
n01440764
n01443537
...
val
n01440764
n01443537
...
```
## Training
Following <a href="https://zhuanlan.zhihu.com/p/67919205" target="blank">this post</a>, clone https://github.com/pytorch/examples, and run the training script under imagenet/ folder
```bash
python3 main.py -a resnet18 ILSVRC2012_folder -b 32 |& tee train_log.txt
GTX1080Ti cannot handle batch size greater than 32. On this hardware, each epoch takes 7.37 hours. After 34 epochs, the accuracy is
Training: Acc@1 44.70 Acc@5 69.38
Validation: Acc@1 51.64 Acc@5 77.35
The best model is stored in model_best.pth.tar. This script only creats one extra checkpoint.
Here’s an example training log
Epoch: [34][39930/40037] Time 2.353 ( 0.646) Data 2.326 ( 0.604) Loss 2.3454e+00 (2.5339e+00) Acc@1 50.00 ( 44.70) Acc@5 71.88 ( 69.39)
Epoch: [34][39940/40037] Time 0.049 ( 0.646) Data 0.000 ( 0.604) Loss 2.7097e+00 (2.5339e+00) Acc@1 43.75 ( 44.70) Acc@5 68.75 ( 69.39)
Epoch: [34][39950/40037] Time 2.296 ( 0.646) Data 2.272 ( 0.604) Loss 2.1355e+00 (2.5339e+00) Acc@1 59.38 ( 44.70) Acc@5 75.00 ( 69.39)
Epoch: [34][39960/40037] Time 0.050 ( 0.646) Data 0.000 ( 0.604) Loss 3.1379e+00 (2.5339e+00) Acc@1 40.62 ( 44.70) Acc@5 59.38 ( 69.39)
Epoch: [34][39970/40037] Time 2.179 ( 0.646) Data 2.152 ( 0.604) Loss 2.8691e+00 (2.5340e+00) Acc@1 46.88 ( 44.70) Acc@5 56.25 ( 69.39)
Epoch: [34][39980/40037] Time 0.050 ( 0.646) Data 0.000 ( 0.604) Loss 1.9133e+00 (2.5339e+00) Acc@1 53.12 ( 44.70) Acc@5 81.25 ( 69.39)
Epoch: [34][39990/40037] Time 2.397 ( 0.646) Data 2.374 ( 0.604) Loss 2.7276e+00 (2.5340e+00) Acc@1 50.00 ( 44.70) Acc@5 56.25 ( 69.39)
Epoch: [34][40000/40037] Time 0.050 ( 0.646) Data 0.000 ( 0.604) Loss 1.9445e+00 (2.5339e+00) Acc@1 53.12 ( 44.70) Acc@5 78.12 ( 69.39)
Epoch: [34][40010/40037] Time 0.894 ( 0.646) Data 0.869 ( 0.604) Loss 2.4231e+00 (2.5340e+00) Acc@1 43.75 ( 44.70) Acc@5 68.75 ( 69.39)
Epoch: [34][40020/40037] Time 0.507 ( 0.646) Data 0.485 ( 0.604) Loss 2.1194e+00 (2.5339e+00) Acc@1 56.25 ( 44.70) Acc@5 75.00 ( 69.39)
Epoch: [34][40030/40037] Time 0.960 ( 0.646) Data 0.922 ( 0.604) Loss 2.4572e+00 (2.5340e+00) Acc@1 40.62 ( 44.70) Acc@5 75.00 ( 69.38)
Test: [ 0/1563] Time 13.309 (13.309) Loss 1.4311e+00 (1.4311e+00) Acc@1 65.62 ( 65.62) Acc@5 90.62 ( 90.62)
Test: [ 10/1563] Time 0.016 ( 1.437) Loss 1.8978e+00 (1.5550e+00) Acc@1 34.38 ( 55.40) Acc@5 90.62 ( 88.35)
Test: [ 20/1563] Time 0.016 ( 0.883) Loss 7.2487e-01 (1.3800e+00) Acc@1 84.38 ( 63.39) Acc@5 93.75 ( 88.69)
Test: [ 30/1563] Time 0.016 ( 0.728) Loss 5.2000e-01 (1.2296e+00) Acc@1 81.25 ( 67.24) Acc@5 93.75 ( 89.62)
Test: [ 40/1563] Time 0.018 ( 0.626) Loss 1.3063e+00 (1.2341e+00) Acc@1 68.75 ( 67.30) Acc@5 93.75 ( 89.63)
Test: [ 50/1563] Time 0.018 ( 0.568) Loss 3.4426e+00 (1.3470e+00) Acc@1 34.38 ( 66.67) Acc@5 62.50 ( 88.11)
Test: [ 60/1563] Time 0.018 ( 0.516) Loss 2.0711e+00 (1.5121e+00) Acc@1 50.00 ( 63.27) Acc@5 78.12 ( 85.91)
Test: [ 70/1563] Time 0.016 ( 0.493) Loss 2.5851e+00 (1.6045e+00) Acc@1 53.12 ( 61.14) Acc@5 68.75 ( 84.24)
Test: [ 80/1563] Time 0.539 ( 0.464) Loss 1.9009e+00 (1.6847e+00) Acc@1 65.62 ( 60.07) Acc@5 75.00 ( 83.18)
Test: [ 90/1563] Time 0.016 ( 0.452) Loss 2.1300e+00 (1.7213e+00) Acc@1 46.88 ( 58.76) Acc@5 78.12 ( 82.97)
Test: [ 100/1563] Time 0.254 ( 0.446) Loss 2.6598e+00 (1.8422e+00) Acc@1 37.50 ( 56.40) Acc@5 65.62 ( 81.16)
Test: [ 110/1563] Time 0.393 ( 0.437) Loss 1.4576e+00 (1.8606e+00) Acc@1 56.25 ( 55.97) Acc@5 84.38 ( 80.86)
Test: [ 120/1563] Time 0.321 ( 0.425) Loss 8.7615e-01 (1.8282e+00) Acc@1 75.00 ( 56.53) Acc@5 96.88 ( 81.40)
Test: [ 130/1563] Time 0.033 ( 0.413) Loss 5.4003e-01 (1.7993e+00) Acc@1 84.38 ( 57.25) Acc@5 100.00 ( 81.66)
MNIST is a handwriting dataset, training it is much easier. Go to the mnist folder, and run python3 main.py
, the dataset will be downloaded automatically. 10 epochs are trained in about 2 minutes, and the validation set accuracy is 98.99%.