Gitlab Community Edition Instance

Skip to content
Snippets Groups Projects
Hauke Kirchner's avatar
Hauke Kirchner authored
9c458788
History
Name Last commit Last update
code
slides
.gitignore
CITATION.cff
README.md

Deep learning with GPU cores

This repository is the course material for the GWDG Course Deep Learning with GPU Cores. You can either take our course or go through the material at your own pace.

What's in this repo?

The repo consists mainly of two folders,

  • slides/ for an introductions to deep learning, GPU and profiling
  • code/ for practical examples to be used on the cluster

Helpful resources

The repository is self-contained. However, you can check out the following material for extra help.

The README also contains 💎 sections, which are optional and to help you if you are stuck at a particular step. If you do this course independently and are not able to solve an encountered issue, please drop us a line in form of an issue. We are there to help.

Getting started

If you would like to run this code on the Cluster, you need an account for the cluster.

  • Are you taking the course? If you are participating in a course , please follow these instructions and complete 1. Logging in and 2. Setting up the Conda Environment in this repository.

  • Doing this on your own? You can get an NHR account as described in our documentation and a KISSKI account at https://kisski.gwdg.de/en/leistungen/documentation/first-steps/. NHR accounts are available and free for all researchers based in Germany. KISSKI is open to companies and researchers from sensitive and critical infrastructures. More details on the procedure can be found in the slides.

Logging in

To log into the cluster, complete the following steps.

  1. Create an account

Create an account at https://academiccloud.de/. If you are working at a research institute or university, you might already have an account. In this case you can skip this step.

  1. Upload an ssh key and change default unix shell
  • Create an ssh key on your computer (Github). For security reasons, it is important that you specify a password when creating this key. The generated key consists of a private part and a .pub part. You will need the .pub part in the next step.

  • Go to https://id.academiccloud.de/ and upload your key as a public ssh public key (-> Security -> SSH Public Keys). It will take some time (~10 minutes) until this key is synchronized in the whole system.

  • Now, navigate to -> Others -> "Unix default shell" and set the value from /bin/ksh to /bin/bash. Otherwise, the provided bash scripts and conda commands will not work properly.

  1. Test access Now you can test whether you have access to the system. The instructors of the course need to add you to the system, so this step only works when they have already done so. If you did not previously have an Academiccloud account, please write a mail to the course organizers. You will need to be added manually to the project. You will receive an e-mail with your user name once you have been added to the project.

Open a shell and adapt the following command:

ssh -i .ssh/your-key your-username@glogin-gpu.hpc.gwdg.de
  • replace your-username with the username you have gotten via mail
  • replace .ssh/your-key with the file path to your newly created ssh key

You are now on the frontend.

💎 If your key "Identity file" is not found, double-check that the path you have provided exists and contains your generated key

Setting up the Conda Environment

If you prefer to set up your environment via a container, have a look at our blog article Declutter your Python environments. This approach will also be demonstrated in a separate section of our workshop.

The default way to run Python programs on the cluster is through conda environments. This ensures that your projects do not interfere with each other and that a minimum level of reproducibility is achieved. If you are taking the course with us, please set up this environment in advance, as the installations may take some time.

To set up the conda environment, first log into the frontend. Then create a conda environment and install the required packages listed in code/requirements.txt as follows:

Clone this repository and create a new conda environment.
git clone https://gitlab-ce.gwdg.de/hpc-team-public/deep-learning-with-gpu-cores.git 
cd  deep-learning-with-gpu-cores/code

module load miniconda3
conda create -n dl-gpu python=3.8

This creates an environment called dl-gpu with the Python version 3.8. If this is your first session on the cluster using miniconda3, you may have to initialize conda via:

conda init 
source ~/.bashrc

Now we need to install all packages. They are located in code/requirements.txt.

Install all dependencies in the environment.

To install all dependencies in your environment, activate the environment and install the requirements with these commands:

source activate dl-gpu
pip install -r requirements.txt

💎 If source activate dl-gpu does not work, use either (1) conda activate dl-gpu or (2) run conda init bash, log out and log in again and then it should be initialized and work.

💎 If you get something similar to ERROR: No matching distribution found for ..., one solution is to ease the requirements. We pinned the exact versions of each libarary. You can do this with nano code/requirements.txt and removing the version number for a package, so instead of package==versionnumber only package is displayed. If you do not require a specific version number, compatibility issues can be avoided, but it might also break things later on (sometimes, code requires a specific version number, so sometimes, it requires a little experimenting with versions).

💎 If there are still problems with your conda setup, you fall back to a shared environment using the following in the submit_train.sh and submit_test.sh instead of conda activate dl-gpu:

conda activate $PROJECT/conda/dl-gpu

How to run Code

  • Are you taking the course? If so, please stop your preparations at this point. 🎉 We'll cover all what follows live!

  • Doing this on your own? For the following steps, you can also watch the recorded videos and follow along. We go more into detail in the recorded session, but you will find the most important steps in here.

Run training scripts

All the jobs that are going to be run have to go through a scheduler. To start the training of the neural network, run the script submit_train.sh in the code directory YOUR_PATH/deep-learning-with-gpu-cores/code/ using the cluster scheduler Slurm:

cd YOUR_PATH/deep-learning-with-gpu-cores/code/
sbatch --reservation=deep-learning-workshop submit_train.sh

An output as well as an error file will be generated in the slurm_files directory.

💎 A more general introduction to slurm can be found in the HLRN documentation and also in the official Slurm scheduler documentation. Why we use the Slurm scheduler, you can re-read at Cluster Concepts.

Run inference scripts

In order to test your Models, you can simply run

sbatch --reservation=deep-learning-workshop submit_test.sh

By default, the loaded Pointnet model is

load_model_path = "./model_test.pt"

You can change the path and test any model from the saved_models directory.

Container for reliable and reproducible software environments

There is a separate README under ./code/container/README.md

FYI: Where the data lives

Where does the data live that is accessed in the training?

Log into the frontend node glogin9. The data for the course lives in /scratch/projects/workshops/gpu-workshop/. There are 2 folders:

  • synthetic_trees_full_resolution the synthetic data in the original format
  • synthetic_trees_ten_sampled the synthetic data a downsampled version for the neural network

This data is reachable from all the Grete-GPUs and the login node glogin9. This data is sufficient for running all examples workflows of this repository.

Our use case is the pre-training of a PointNet1 model using synthetic lidar data of forests. The data was generated with SynForest, a workflow developed by us in the project ForestCare. Using SynForest you can easily generate lidar data of forests that fits your requirements.

How to use your own data

How can you upload your own data?

When you build your own deep learning workflows on the cluster, you might want to up- and download data to and from the cluster. Fort this, you can use the tool rsync, for instance:

  • To copy the folder /deep-learning-with-gpu-cores from the cluster to your current directory on your local PC, run on your local PC:
rsync -rp USERNAME@glogin.hlrn.de:/home/USERNAME/deep-learning-with-gpu-cores .

with USERNAME your username on Emmy.

  • To copy the file README.md from your local PC to your home folder on the cluster, run on your local PC:
scp -rp README.md USERNAME@glogin.hlrn.de:/home/USERNAME/

💎 If you do not have written your .ssh key into your ~/.ssh/config file, you might get a permission denied error. In this case, use the flag -i path_to_your_ssh_key to get access.

💎 Make sure that the paths are correct. You can always use pwd in the respective folders and copy-and-paste these directories into the command.

How to run the Profiler

⚠️ This section is no longer part of the "Deep Learning with GPU Cores" course, but is now part of the "Performance Analysis of AI and HPC Workloads" course.

You can profile the code to see in which function the execution spends time in. You can uncomment in both in the Slurm scripts code/submit_test.sh and code/submit_train.sh the last line. So instead of running train.py, you you will run train_with_logger.py.

Setup the Profiler

Please make sure to install PyTorch Profiler TensorBoard Plugin on your local machine with pip install torch_tb_profiler. This is used for visualisation of the profiling results.

Download and visualize profiling results of example run

You can download and visualize the the profiling results as follows.

# download example profiling output 
rsync -avvH <user>@glogin9.hlrn.de:/scratch/projects/workshops/gpu-workshop/profiling_example ~/Downloads
cd ~/Downloads/profiling_example/train-nn-gpu_4464217
# open PyTorch Profiler's output with TensorBoard
tensorboard --logdir=./profiler
# open DeepSpeed's output with texteditor
vim slurm-train-nn-gpu-4464217.out

Profiling - Further reading

A typical bottleneck for deep learning applications is data loading. An improved approach to data loading in our use case using DASK is described in our blog article Parallel 3D Point Cloud Data analysis with Dask.

Run on SCC cluster

This section is only needed if you are running the workflow on the SCC cluster instead of the Grete cluster.

The steps to run on SCC cluster are similar to that of the Grete cluster. However, minor changes are to be made to the submit_train.sh in the code directory YOUR_PATH/deep-learning-with-gpu-cores/code/. The below changes are to be made:

  1. '#SBATCH -p grete:shared' change to '#SBATCH -p gpu'
  2. '#SBATCH -G A100:1' change to '#SBATCH -G 1'
  3. 'module load cuda' change to 'source $ANACONDA3_ROOT/etc/profile.d/conda.sh'
  4. 'source activate dl-gpu' change to 'conda activate dl-gpu'
  5. Remove 'nvcc -V'
  6. Make sure to execute the submit_train.sh script when the conda environment is in deactivated state on the SCC cluster.

Contact and Citation

If you have any questions or if you encounter any issues, please contact us: hauke.kirchner@gwdg.de.

@software{deeplearningwithgpucores2023,
  author = {Sommer, Dorothea and Kirchner, Hauke and Meisel, Tino},
  month = {04},
  title = {{Deep Learning with GPU Cores}},
  url = {https://gitlab-ce.gwdg.de/hpc-team-public/deep-learning-with-gpu-cores/},
  version = {1.0},
  year = {2023}
}
  1. Qi et al. (2017) PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation (https://github.com/charlesq34/pointnet)