Deep learning with GPU cores
This repository is the course material for the GWDG Course Deep Learning with GPU Cores. You can either take our course or go through the material at your own pace.
What's in this repo?
The repo consists mainly of two folders,
-
slides/
for an introductions to deep learning, GPU and profiling -
code/
for practical examples to be used on the cluster
Helpful resources
The repository is self-contained. However, you can check out the following material for extra help.
-
🤔 Is this the first time on a cluster? Have a look at our article on Cluster Concepts.
-
🚀 Practical sessions to follow along using recorded videos.
-
👾 For future reference, you can always look at the GPU documentation, have a look at the GPU documentation.
-
⏱️ For reference, you can always look at the Slurm scheduler documentation.
The README also contains 💎 sections, which are optional and to help you if you are stuck at a particular step. If you do this course independently and are not able to solve an encountered issue, please drop us a line in form of an issue. We are there to help.
Getting started
If you would like to run this code on the Cluster, you need an account for the cluster.
-
Are you taking the course? If you are participating in a course , please follow these instructions and complete 1. Logging in and 2. Setting up the Conda Environment in this repository.
-
Doing this on your own? You can get an NHR account as described in our documentation and a KISSKI account at https://kisski.gwdg.de/en/leistungen/documentation/first-steps/. NHR accounts are available and free for all researchers based in Germany. KISSKI is open to companies and researchers from sensitive and critical infrastructures. More details on the procedure can be found in the slides.
Logging in
To log into the cluster, complete the following steps.
- Create an account
Create an account at https://academiccloud.de/. If you are working at a research institute or university, you might already have an account. In this case you can skip this step.
- Upload an ssh key and change default unix shell
-
Create an ssh key on your computer (Github). For security reasons, it is important that you specify a password when creating this key. The generated key consists of a private part and a
.pub
part. You will need the.pub
part in the next step. -
Go to https://id.academiccloud.de/ and upload your key as a public ssh public key (-> Security -> SSH Public Keys). It will take some time (~10 minutes) until this key is synchronized in the whole system.
-
Now, navigate to -> Others -> "Unix default shell" and set the value from
/bin/ksh
to/bin/bash
. Otherwise, the provided bash scripts and conda commands will not work properly.
- Test access Now you can test whether you have access to the system. The instructors of the course need to add you to the system, so this step only works when they have already done so. If you did not previously have an Academiccloud account, please write a mail to the course organizers. You will need to be added manually to the project. You will receive an e-mail with your user name once you have been added to the project.
Open a shell and adapt the following command:
ssh -i .ssh/your-key your-username@glogin-gpu.hpc.gwdg.de
- replace
your-username
with the username you have gotten via mail - replace
.ssh/your-key
with the file path to your newly created ssh key
You are now on the frontend.
💎 If your key "Identity file" is not found, double-check that the path you have provided exists and contains your generated key
Setting up the Conda Environment
If you prefer to set up your environment via a container, have a look at our blog article Declutter your Python environments. This approach will also be demonstrated in a separate section of our workshop.
The default way to run Python programs on the cluster is through conda environments. This ensures that your projects do not interfere with each other and that a minimum level of reproducibility is achieved. If you are taking the course with us, please set up this environment in advance, as the installations may take some time.
To set up the conda environment, first log into the frontend. Then create a conda environment and install the required packages listed in code/requirements.txt
as follows:
Clone this repository and create a new conda environment.
git clone https://gitlab-ce.gwdg.de/hpc-team-public/deep-learning-with-gpu-cores.git
cd deep-learning-with-gpu-cores/code
module load miniconda3
conda create -n dl-gpu python=3.8
This creates an environment called dl-gpu
with the Python version 3.8. If this is your first session on the cluster using miniconda3, you may have to initialize conda via:
conda init
source ~/.bashrc
Now we need to install all packages. They are located in code/requirements.txt
.
Install all dependencies in the environment.
To install all dependencies in your environment, activate the environment and install the requirements with these commands:
source activate dl-gpu
pip install -r requirements.txt
💎 If source activate dl-gpu
does not work, use either (1) conda activate dl-gpu
or (2) run conda init bash
, log out and log in again and then it should be initialized and work.
💎 If you get something similar to ERROR: No matching distribution found for ...
, one solution is to ease the requirements. We pinned the exact versions of each libarary. You can do this with nano code/requirements.txt
and removing the version number for a package, so instead of package==versionnumber
only package
is displayed. If you do not require a specific version number, compatibility issues can be avoided, but it might also break things later on (sometimes, code requires a specific version number, so sometimes, it requires a little experimenting with versions).
💎 If there are still problems with your conda setup, you fall back to a shared environment using the following in the submit_train.sh
and submit_test.sh
instead of conda activate dl-gpu
:
conda activate $PROJECT/conda/dl-gpu
How to run Code
-
Are you taking the course? If so, please stop your preparations at this point. 🎉 We'll cover all what follows live!
-
Doing this on your own? For the following steps, you can also watch the recorded videos and follow along. We go more into detail in the recorded session, but you will find the most important steps in here.
Run training scripts
All the jobs that are going to be run have to go through a scheduler. To start the training of the neural network, run the script submit_train.sh
in the code directory YOUR_PATH/deep-learning-with-gpu-cores/code/
using the cluster scheduler Slurm:
cd YOUR_PATH/deep-learning-with-gpu-cores/code/
sbatch --reservation=deep-learning-workshop submit_train.sh
An output as well as an error file will be generated in the slurm_files
directory.
💎 A more general introduction to slurm can be found in the HLRN documentation and also in the official Slurm scheduler documentation. Why we use the Slurm scheduler, you can re-read at Cluster Concepts.
Run inference scripts
In order to test your Models, you can simply run
sbatch --reservation=deep-learning-workshop submit_test.sh
By default, the loaded Pointnet model is
load_model_path = "./model_test.pt"
You can change the path and test any model from the saved_models
directory.
Container for reliable and reproducible software environments
There is a separate README under ./code/container/README.md
FYI: Where the data lives
Where does the data live that is accessed in the training?
Log into the frontend node glogin9
. The data for the course lives in /scratch/projects/workshops/gpu-workshop/
. There are 2 folders:
-
synthetic_trees_full_resolution
the synthetic data in the original format -
synthetic_trees_ten_sampled
the synthetic data a downsampled version for the neural network
This data is reachable from all the Grete-GPUs and the login node glogin9
. This data is sufficient for running all examples workflows of this repository.
Our use case is the pre-training of a PointNet1 model using synthetic lidar data of forests. The data was generated with SynForest, a workflow developed by us in the project ForestCare. Using SynForest you can easily generate lidar data of forests that fits your requirements.
How to use your own data
How can you upload your own data?
When you build your own deep learning workflows on the cluster, you might want to up- and download data to and from the cluster.
Fort this, you can use the tool rsync
, for instance:
- To copy the folder
/deep-learning-with-gpu-cores
from the cluster to your current directory on your local PC, run on your local PC:
rsync -rp USERNAME@glogin.hlrn.de:/home/USERNAME/deep-learning-with-gpu-cores .
with USERNAME your username on Emmy.
- To copy the file
README.md
from your local PC to your home folder on the cluster, run on your local PC:
scp -rp README.md USERNAME@glogin.hlrn.de:/home/USERNAME/
💎 If you do not have written your .ssh
key into your ~/.ssh/config
file, you might get a permission denied error. In this case, use the flag -i path_to_your_ssh_key
to get access.
💎 Make sure that the paths are correct. You can always use pwd
in the respective folders and copy-and-paste these directories into the command.
How to run the Profiler
⚠️ This section is no longer part of the "Deep Learning with GPU Cores" course, but is now part of the "Performance Analysis of AI and HPC Workloads" course.
You can profile the code to see in which function the execution spends time in. You can uncomment in both in the Slurm scripts code/submit_test.sh
and code/submit_train.sh
the last line. So instead of running train.py
, you you will run train_with_logger.py
.
Setup the Profiler
Please make sure to install PyTorch Profiler TensorBoard Plugin on your local machine with pip install torch_tb_profiler
. This is used for visualisation of the profiling results.
Download and visualize profiling results of example run
You can download and visualize the the profiling results as follows.
# download example profiling output
rsync -avvH <user>@glogin9.hlrn.de:/scratch/projects/workshops/gpu-workshop/profiling_example ~/Downloads
cd ~/Downloads/profiling_example/train-nn-gpu_4464217
# open PyTorch Profiler's output with TensorBoard
tensorboard --logdir=./profiler
# open DeepSpeed's output with texteditor
vim slurm-train-nn-gpu-4464217.out
Profiling - Further reading
A typical bottleneck for deep learning applications is data loading. An improved approach to data loading in our use case using DASK is described in our blog article Parallel 3D Point Cloud Data analysis with Dask.
Run on SCC cluster
This section is only needed if you are running the workflow on the SCC cluster instead of the Grete cluster.
The steps to run on SCC cluster are similar to that of the Grete cluster. However, minor changes are to be made to the submit_train.sh
in the code directory YOUR_PATH/deep-learning-with-gpu-cores/code/
. The below changes are to be made:
- '#SBATCH -p grete:shared' change to '#SBATCH -p gpu'
- '#SBATCH -G A100:1' change to '#SBATCH -G 1'
- 'module load cuda' change to 'source $ANACONDA3_ROOT/etc/profile.d/conda.sh'
- 'source activate dl-gpu' change to 'conda activate dl-gpu'
- Remove 'nvcc -V'
- Make sure to execute the
submit_train.sh
script when the conda environment is in deactivated state on the SCC cluster.
Contact and Citation
If you have any questions or if you encounter any issues, please contact us: hauke.kirchner@gwdg.de.
@software{deeplearningwithgpucores2023,
author = {Sommer, Dorothea and Kirchner, Hauke and Meisel, Tino},
month = {04},
title = {{Deep Learning with GPU Cores}},
url = {https://gitlab-ce.gwdg.de/hpc-team-public/deep-learning-with-gpu-cores/},
version = {1.0},
year = {2023}
}
-
Qi et al. (2017) PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation (https://github.com/charlesq34/pointnet) ↩