Jetson AGX Xavier에 CNTK 2.7 C++ 라이브러리 설치하기(TX2 추가)

05 Jul 2019

이 포스트에서는 Jetson AGX Xavier(이하 Xavier)에 CNTK 2.7 C++ 라이브러리를 설치하는 방법을 설명하겠습니다. 이 포스트는 공식 CNTK 사이트의 Setup CNTK on Linux 문서를 참조하였습니다. 또한 이 포스트는 Jetson TX1에 CNTK 2.5 설치하기의 후속 포스트입니다.

(2019-07-12 추가) Jetson TX2에서도 본 포스트에 적힌 방식대로 CNTK를 빌드하고 테스트할 수 있습니다. 단, JetPack 4.2 버전이 설치된 TX2에서만 가능합니다. 보다 자세한 내용은 포스트 하단을 참고해주세요.

Docker 이미지를 활용하는 방법 (CPU 버전)

Xavier 용 CNTK 2.7 C++ 라이브러리가 포함된 Docker 이미지를 여기에 올려놓았습니다. 단, 이 이미지에는 CPU 버전만 제공되어 있습니다. GPU 기능이 활성화된 버전을 이용하려면 nvidia-docker를 이용해야 하지만 nvidia-docker는 Jetson 계열이 사용하는 Tegra 플랫폼 GPU를 지원하지 않습니다. 따라서 Xavier에서 GPU 기능이 활성화된 버전을 이용하려면 CNTK를 직접 빌드해야 합니다.

Docker 이미지는 다음과 같이 받을 수 있습니다.

$ docker pull nglee/xavier_cntk:2.7_CPU_JetPack_4.2

Docker 이미지에 대한 추가 정보는 여기를 참고하세요. (Dockerfile은 여기)

직접 빌드하는 방법 (GPU 버전)

먼저 Xavier에 JetPack을 설치합니다. 포스트 작성 시점에서 최신 JetPack 버전인 4.2 버전이 설치되어 있다고 가정하겠습니다.

Xavier가 높은 성능을 낼 수 있도록 전력 소모 제한 옵션을 풀어줍니다.

$ sudo nvpmodel -m 0

apt를 이용해 필요한 라이브러리들을 설치해줍니다.

$ sudo apt update
$ sudo apt install -y cmake libbz2-dev curl gfortran libzip-dev

이제 CNTK 설치를 위한 사전작업으로 다음의 라이브러리들을 설치하겠습니다. Xavier의 저장소 크기가 제한되어 있으므로 각 라이브러리의 설치가 끝나면 라이브러리 소스들을 제거하는 것도 좋은 방법입니다.

OpenBLAS 0.3.6
Boost 1.70.0
CUB 1.8.0
Protobuf 3.1.0

위 라이브러리들을 설치한 후에 CNTK의 configure 스크립트가 위 라이브러리를 잘 찾을 수 있도록 symbolic link를 만드는 과정을 거치도록 하겠습니다. 여기까지 마치면 CNTK를 빌드할 수 있습니다.

OpenBLAS 0.3.6

$ git clone -q --depth 1 -b v0.3.6 https://github.com/xianyi/OpenBLAS.git
$ cd OpenBLAS
$ make TARGET=ARMV8 -j$(nproc)
$ sudo make install TARGET=ARMV8 PREFIX=/usr/local/OpenBLAS-0.3.6

Boost 1.70.0

$ git clone -q --depth 1 -b boost-1.70.0 --recursive -j $(nproc) https://github.com/boostorg/boost.git
$ cd boost
$ ./bootstrap.sh --prefix=/usr/local/boost-1.70.0
$ sudo ./b2 -d0 -j$(nproc) cxxflags="-std=c++14 -fPIC" linkflags="-Wl,-rpath,'\$ORIGIN'" install

CUB 1.8.0

$ wget https://github.com/NVlabs/cub/archive/1.8.0.zip
$ unzip ./1.8.0.zip
$ sudo cp -r cub-1.8.0 /usr/local

Protobuf 3.1.0

$ wget https://github.com/google/protobuf/archive/v3.1.0.tar.gz
$ tar xzf v3.1.0.tar.gz
$ cd protobuf-3.1.0
$ ./autogen.sh
$ ./configure CFLAGS=-fPIC CXXFLAGS=-fPIC --disable-shared --prefix=/usr/local/protobuf-3.1.0
$ make -j$(nproc)
$ sudo make install

추가 설정

$ sudo ln -s /usr/include/mpi/mpi.h /usr/include/mpi.h

$ sudo mkdir /usr/local/cudnn/cuda/include -p
$ sudo ln -s /usr/lib/aarch64-linux-gnu /usr/local/cudnn/lib
$ sudo ln -s /usr/include/aarch64-linux-gnu/cudnn_v7.h /usr/local/cudnn/cuda/include/cudnn.h

CNTK 2.7

이제 CNTK를 빌드하겠습니다.

$ git clone -q --depth 1 -b v2.7 --recursive -j $(nproc) https://github.com/Microsoft/CNTK.git
$ cd CNTK
$ mkdir build/release -p
$ cd build/release
$ ../../configure --asgd=no \
                  --cuda=yes \
                  --with-cuda=/usr/local/cuda \
                  --with-cudnn=/usr/local/cudnn \
                  --with-opencv=/usr \
                  --with-openblas=/usr/local/OpenBLAS-0.3.6 \
                  --with-boost=/usr/local/boost-1.70.0 \
                  --with-cub=/usr/local/cub-1.8.0 \
                  --with-protobuf=/usr/local/protobuf-3.1.0 \
                  --with-mpi=/usr
$ make -C ../../ \
       BUILD_TOP=$PWD \
       SSE_FLAGS='' \
       GENCODE_FLAGS='-gencode arch=compute_72,code=\"sm_72,compute_72\"' \
       all \
       -j $(nproc)

빌드가 완료되면 lib와 bin 디렉토리가 생성되고 그 밑에 CNTK 2.7 라이브러리와 실행파일이 위치하게 됩니다.

테스트

CNTK가 제대로 빌드되었는지 확인하기 위해 여기에 나온 테스트 스텝을 따라할 수 있습니다. 그런데 Jetson 계열이 사용하는 Tegra 플랫폼은 NVML(NVIDIA Management Library) 이용과 관련해서 조심해야 할 부분이 있습니다. 여기에 나와있듯이 NVML은 Tegra 플랫폼을 지원하지 않고, stub 라이브러리만 제공합니다. 테스트에 사용할 cntk라는 바이너리는 NVML을 필요로하기 때문에 stub 라이브러리라도 사용해서 동작하도록 설정하겠습니다.

$ sudo ln -s /usr/local/cuda/targets/aarch64-linux/lib/stubs/libnvidia-ml.so /usr/local/cuda/targets/aarch64-linux/lib/stubs/libnvidia-ml.so.1
$ sudo bash -c "echo /usr/local/cuda/targets/aarch64-linux/lib/stubs >> /etc/ld.so.conf.d/cntk.conf"
$ sudo ldconfig

먼저 다음과 같이 CPU로 CNTK를 돌려보겠습니다. 다음 명령어에서 [CNTK_SOURCE_BASE]는 위에서 git을 통해 받은 CNTK 소스코드 경로를 의미합니다.

$ export PATH=[CNTK_SOURCE_BASE]/build/release/bin:$PATH
$ cd [CNTK_SOURCE_BASE]/Tutorials/HelloWorld-LogisticRegression
$ cntk configFile=lr_bs.cntk makeMode=false

대략 다음과 같은 내용이 화면에 출력되면서 학습이 진행되어야 합니다.

... (앞 부분 생략) ...

##############################################################################
#                                                                            #
# Train command (train action)                                               #
#                                                                            #
##############################################################################


Model has 9 nodes. Using CPU.

Training criterion:   lr = Logistic
Evaluation criterion: err = SquareError

Training 3 parameters in 2 parameter tensors.

Finished Epoch[ 1 of 50]: [Training] lr = 0.31759290 * 1000; err = 0.09908521 * 1000; totalSamplesSeen = 1000; learningRatePerSample = 0.039999999; epochTime=0.323581s
Finished Epoch[ 2 of 50]: [Training] lr = 0.11039351 * 1000; err = 0.02357974 * 1000; totalSamplesSeen = 2000; learningRatePerSample = 0.039999999; epochTime=0.273736s
Finished Epoch[ 3 of 50]: [Training] lr = 0.08720607 * 1000; err = 0.01866767 * 1000; totalSamplesSeen = 3000; learningRatePerSample = 0.039999999; epochTime=0.259632s
Finished Epoch[ 4 of 50]: [Training] lr = 0.07586163 * 1000; err = 0.01674400 * 1000; totalSamplesSeen = 4000; learningRatePerSample = 0.039999999; epochTime=0.23207s
Finished Epoch[ 5 of 50]: [Training] lr = 0.06810059 * 1000; err = 0.01533284 * 1000; totalSamplesSeen = 5000; learningRatePerSample = 0.039999999; epochTime=0.239284s
Finished Epoch[ 6 of 50]: [Training] lr = 0.06305315 * 1000; err = 0.01422650 * 1000; totalSamplesSeen = 6000; learningRatePerSample = 0.039999999; epochTime=0.265162s
Finished Epoch[ 7 of 50]: [Training] lr = 0.06117695 * 1000; err = 0.01446881 * 1000; totalSamplesSeen = 7000; learningRatePerSample = 0.039999999; epochTime=0.235583s
Finished Epoch[ 8 of 50]: [Training] lr = 0.05866559 * 1000; err = 0.01376450 * 1000; totalSamplesSeen = 8000; learningRatePerSample = 0.039999999; epochTime=0.274448s

... (중간 생략) ...

Finished Epoch[49 of 50]: [Training] lr = 0.04173198 * 1000; err = 0.01124937 * 1000; totalSamplesSeen = 49000; learningRatePerSample = 0.039999999; epochTime=0.208398s
Finished Epoch[50 of 50]: [Training] lr = 0.04399341 * 1000; err = 0.01202173 * 1000; totalSamplesSeen = 50000; learningRatePerSample = 0.039999999; epochTime=0.214831s


##############################################################################
#                                                                            #
# Output command (write action)                                              #
#                                                                            #
##############################################################################

Minibatch[0]: ActualMBSize = 500
Written to LR.txt*
Total Samples Evaluated = 500


##############################################################################
#                                                                            #
# DumpNodeInfo command (dumpNode action)                                     #
#                                                                            #
##############################################################################

Warning: node name '__AllNodes__' does not exist in the network. dumping all nodes instead.


##############################################################################
#                                                                            #
# Test command (test action)                                                 #
#                                                                            #
##############################################################################

evalNodeNames are not specified, using all the default evalnodes and training criterion nodes.
Final Results: Minibatch[1-1]: err = 0.00685278 * 500; lr = 0.02953914 * 500

COMPLETED.

GPU로 CNTK를 돌리기 위해서는 다음과 같이 합니다.

$ cntk configFile=lr_bs.cntk makeMode=false deviceId=auto

CPU로 돌리는 예제와 비교해 보면 훨씬 빠르게 학습이 진행됨을 확인할 수 있을 것입니다.

(2019-07-12 추가) Jetson TX2에서 빌드하기

JetPack 4.2 버전이 설치되어 있다면 TX2도 위에서 언급된 Docker 이미지를 이용할 수 있습니다. 본 포스트의 해당 항목대로 동일하게 진행하면 됩니다. 또한 직접 빌드도 Xavier와 거의 동일하게 진행할 수 있습니다. 마지막에 CNTK를 빌드할 때 다음의 두 가지 부분만 고려하면 됩니다. 하나는 CUDA CC 값을 TX2에 맞게 6.2로 주는 것입니다. 다른 하나는 -j 옵션 값 조정인데 TX2에서는 -j 옵션을 $(nproc)으로 주었을 경우 컴파일 과정에서 에러가 납니다. (TX2의 코어 수는 6개) -j 옵션을 4로 낮추었더니 빌드가 잘 되었습니다. 결론적으로 CNTK 빌드 명령어를 다음과 같이 하니 잘 빌드되었습니다.

$ make -C ../../ \
       BUILD_TOP=$PWD \
       SSE_FLAGS='' \
       GENCODE_FLAGS='-gencode arch=compute_62,code=\"sm_62,compute_62\"' \
       all \
       -j 4

Show ChangeLog

Version	Description	Date
1.0	Publish	2019-07-05
1.1	Minor fixes	2019-07-08
1.2	Guide for TX2 added	2019-07-12

```
$ git --version
git version 2.17.1
$ g++ --version | head -1
g++ (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1) 7.4.0
$ docker --version
Docker version 18.06.1-ce, build e68fc7a
$ nvcc --version | tail -1
Cuda compilation tools, release 10.0, V10.0.166
```
```
$ cmake --version | head -1
cmake version 3.10.2
```

Pensieve Books, Programming