This vignette will describe the utilization of a high-performance computing (HPC) cluster for large-scale airborne lidar scanning data processing. The used HPC clsuter is provided by the GWDG1. Therefore, this description is adapted to the specific requirements of this system. However, the primary strategies should also work on other HPC systems using the SLURM workload manager2.

1 Motivation

1.1 Lidar data and individual tree segmentation

“Airborne Laser Scanning (ALS) is an active remote sensing technology based on lidar-sensors mounted on airplanes or helicopters. A lidar-sensor emits and records laser pulses, which provide the distance between an object and the sensor. Combined with the lidar-sensors position (Global Navigation Satellite Systems (GNSS)) and orientation (Inertial Measurement Unit (IMU), a georeferenced, three-dimensional model of the surveyed landscape can be captured Lindberg and Holmgren (2017). Depending on the objects within the landscape each laser pulse can result in several returns.

The main variables are northing, easting, and height for each recorded point (figure 1.1). Other variables recorded are intensity, return number, or number of returns. Intensity represents the return magnitude of all returns of a pulse The American Society for Photogrammetry & Remote Sensing (2011). The information of the return order for each return of a pulse is encoded within the return number field. Number of returns is the total number of returns per laser pulse. These variables can be used to characterize objects in the landscape. In the context of forestry, ALS data has a great potential due to its ability of directly measure structural information, such as tree height Wulder et al. (2008). Additionally, it was shown that the intensity can contribute to species classification Shi et al. (2018) .”, Kirchner (2021).

Due to these characteristics of lidar data, it is possible to identify single trees on a large scale. “Individual tree detection (ITD) is the process of spatially locating trees and extracting height information. Individual tree segmentation (ITS) is the process of individually delineating detected trees.”, Roussel, Goodbody, and Tompalski (2021).

Airborne Laser Scanning (ALS) data colored by height. (Source: @my_master_thesis)

Figure 1.1: Airborne Laser Scanning (ALS) data colored by height. (Source: Kirchner (2021))

1.2 High-perforamnce computing

For optimal utilization of the HPC resources, it is critical to enforce parallel processing of the workflow. Since all tiles of an airborne lidar data set can be processed independently of each other, they can be processed in parallel as embarrassingly parallel jobs3 (figure 1.2). Therefore parallelization can be achieved with job arrays4.

Job managing on HPC. Due to the application of array jobs with relatively small requirements, it is possible to find more adequate computing slots on HPC. On the left side, a schematic visualization of the job management on an HPC cluster with two nodes and three cores each is shown over time. Rectangles represent individual jobs. The more resources a job requests, the larger the area of the rectangle is. The LAScatalog with the processing status for each tile is shown on the right side.

Figure 1.2: Job managing on HPC. Due to the application of array jobs with relatively small requirements, it is possible to find more adequate computing slots on HPC. On the left side, a schematic visualization of the job management on an HPC cluster with two nodes and three cores each is shown over time. Rectangles represent individual jobs. The more resources a job requests, the larger the area of the rectangle is. The LAScatalog with the processing status for each tile is shown on the right side.

2 Getting started with the HPC system

First, you will have to get access to our HPC system and connect to it via SSH.

3 Setting up R

Attention: This workflow will not work on the GWDGs RStudio Server5.

3.1 Spack

Contrary to your personal computer, the software is not instantly available but needs to be loaded before you can use them. On our system, this is done with Spack package manager6.

For this tutorial, you need to execute the following bash commands. These commands make sure that all the required software is available. First we need to load R7, gdal8, and fftw9. After that, we need to install a geos10 version that is newer than the one currently provided globally on our HPC system.

# show available software
module avail

# show loaded software
module list
module load r/4.1.1
module load gdal/3.4.1
module load fftw

geos-config --version
module load spack-user
source $SPACK_USER_ROOT/share/spack/setup-env.sh
# just needed once: spack install geos@3.9.1
spack load geos@3.9.1

3.2 Installation of R packages in custom directory

You need to clone the provided repository and start R for this tutorial.

mkdir -p ~/R/v411
mkdir ~/projects
cd ~/projects
git clone https://gitlab-ce.gwdg.de/hauke.gronenberg/workshop-forests-in-hpc.git
R

After that, you need to install the lidR package11 and the R package from the cloned repository into a personal directory.

lib_path <- "~/R/v411"
.libPaths(lib_path)

install.packages("lidR", lib = lib_path)
# choose CRAN mirror Goettingen if asked
library(lidR, lib.loc=lib_path)


install.packages("~/projects/workshop-forests-in-hpc/forestHPC", repos = NULL, lib = lib_path)

Now everything is setup, and we can prepare our analysis.

3.3 Example data

For demonstration purposes, example data generated based on data provided by the lidR package is used. We can achieve a tiled data set similar to the one shown above.

.libPaths("~/R/v411")
library(forestHPC)

data_dir <- "~/tmp/lidR_arrayjob_example/data/"

dir.create(data_dir, recursive = TRUE)
ctg <- forestHPC:::get_example_LAScatalog(path = data_dir)
#lidR::plot(ctg)

4 Running Jobs with SLURM

Attention: Each analysis needs adapted versions of the following code. Please make sure to adjust the scripts to your own needs.

To use our HPC cluster, you need to submit a job to our batch system12. The login nodes are shared nodes and are only meant to be used for pre-, and post-processing jobs and interact with the batch system. Therefore, using them for computationally heavy tasks impedes the work of other users.

As our example data can be processed independently (see section Motivation), we will use array jobs to process them. Array jobs are used to execute the individual tree crown clustering workflow (?forestHPC::do_watershed) for each tile individually. The first step is to define custom parameters and requirements for the array jobs in a script. Here we define the job name (-J), and a file used to save the standard output (-o). Latter is essential for debugging and troubleshooting. Critical is also the definition of a time limit for each job (-t) and the required memory per CPU (--mem-per-cpu). All details can be found in our documentation13, man sbatch or https://slurm.schedmd.com/sbatch.html.

After the part with SLURM-specific code, the required software is loaded. On our HPC, this is done with the command module load14.

Lines starting with echo are used to print SLURM settings to the standard output.

Finally, the function call is executed by Rscript. The command time prints the resource usage to the standard output (see man time for further information). The passed parameter $SLURM_ARRAY_TASK_ID represents the job id of the current task. This id is used to iterate over the tiles of a LAScatalog object.

The presented approach to processing a LAScatalog with job arrays is a workaround. Originally LAScatalog objects were designed for parallel processing with the future package15. Unfortunately, job arrays are not yet supported16.

The jobs are started by calling the script with sbatch -a <i>-<j> ~/projects/forestHPC/inst/script/array_job.sh. The parameter <i> and <j> needs to be set to integers and define the range of indexes. For example, sbatch -a 5-10 ~/projects/workshop-forests-in-hpc/forestHPC/inst/script/lidR_arrayjob_example.sh will lead to the processing of tiles 5, 6, 7, 8, 9, and 10 of the LAScatalog.

#!/bin/bash

#SBATCH -J lidR_arrayjob_example
#SBATCH -o /usr/users/%u/%x-%A-%a.log
#SBATCH -t 0-00:10:00

#SBATCH --mem-per-cpu=2G

# Load modules here
module load spack-user
source $SPACK_USER_ROOT/share/spack/setup-env.sh
spack load geos@3.9.1
module load fftw
module load r/4.1.1
module load gdal/3.4.1

echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST
echo "CPU: $SLURM_CPUS_ON_NODE"
echo "SLURM_NNODES"=$SLURM_NNODES
echo "Array ID" $SLURM_ARRAY_TASK_ID

# Execute your app
time Rscript --vanilla ~/projects/workshop-forests-in-hpc/forestHPC/inst/script/lidR_arrayjob_example.R $SLURM_ARRAY_TASK_ID

# execute: sbatch -a 1-12 ~/projects/workshop-forests-in-hpc/forestHPC/inst/script/lidR_arrayjob_example.sh

The formerly discussed script will execute the following R script n times, where n is the number of requested jobs. For example, sbatch -a 5-10 ~/projects/workshop-forests-in-hpc/forestHPC/inst/script/lidR_arrayjob_example.sh will execute the R script six times. The last line in the array job script executes an R script, executing the ITD workflow (see code chunk below). First, project-specific variables must be defined (e.g., output directory out_path). Then the script parses the input parameter, which is the id of the current job. This id is used to select the tile of interest for the specific job after the LAScatalog is read. After that, the buffer and the naming schema for output files are defined. Finally, the ITD workflow is started.

After the job submission, information about the jobs can be seen with the command squeue -u $USER17. Further, the standard output is written to the file specified with #SBATCH -o.

5 Example analysis

The following R script is used to perform the individual tree detection with the watershed algorithm. This script is executed in the last line of the Slurm script shown above (lidR_arrayjob_example.R).

After loading the R package that is used for this analysis and the definition of global variables, the array job ID is parsed into the variable id. After that, all las files are read as a LASCatalog18. Then the array job ID (id) is used to define the tile that this specific job will process.

.libPaths("~/R/v411")
library(forestHPC)
library(lidR)

# define global variables
proj_dir <- "~/tmp/lidR_arrayjob_example/"
out_path  <- paste0(proj_dir, "out/")
data_path <- paste0(proj_dir, "data/")

# create output directory if not existing
if(!dir.exists(out_path)){dir.create(out_path, recursive = TRUE)}

# read parameter values
args = commandArgs(trailingOnly=TRUE)
if (length(args)==0) {
  stop("no parameters", call.=FALSE)
} else if (length(args)==1) {
  id <- as.integer(args[1])
}

# read LASCatalog and limit to analysis to tile of interest
files <- list.files(path = data_path, pattern = "\\.las$", full.names = T)
ctg <- lidR::readLAScatalog(files)
ctg$processed <- FALSE
ctg$processed[id] <- TRUE

# define workflow variables
lidR::opt_chunk_buffer(ctg) <- 5
lidR::opt_select(ctg) <- "xyz"
lidR::opt_output_files(ctg) <- paste0(out_path, "/{ORIGINALFILENAME}_watershed")

# watershed workflow ###########################################################
forestHPC::do_watershed(ctg)

The following image shows the status of the LAScatalog in array job five after all settings are applied but before the watershed is executed. The setting applied to the LASCatalog are the defined buffer area (here 5 meters) and the processing status of the tiles. In the image below, the buffer area is indicated by a dotted line, and a red border highlights the active tile.

5.1 Evaluation of Resource Usage on HPC

Resource usage can be analyzed with the command sacct. Following an example output will be analyzed.

# !!! replace <job_id>
sacct --format=JobID,JobIDRaw,JobName,User,Group,Partition,MaxRSS,MaxPages,AveCPU,MaxDiskWrite,MaxDiskRead,MaxVMSize,NTasks,AllocCPUS,Submit,Start,Elapsed,End,State,ExitCode,ReqMem,Timelimit -s BF,CA,CD,F,NF,PR,TO -P -a -j <job_id> > ~/tmp/lidR_arrayjob_example/out/<job_id>_sacct.txt
Resource usage on HPC per job.

Figure 5.1: Resource usage on HPC per job.

5.2 Transfer data

After the jobs are finished on the HPC cluster, the results need further processing. Here we will use rsync to get a local copy of our results. For further information on how to transfer data from and to the HPC cluster, you can have a look at our documentation: https://docs.gwdg.de/doku.php?id=en:services:application_services:high_performance_computing:transfer_data

rsync -avvH {USERID}@transfer-scc.gwdg.de:~/tmp/lidR_arrayjob_example ~/Downloads

5.3 Plotting segmented trees

# read one of the processed tiles
las <- lidR::readLAS("~/Downloads/lidR_arrayjob_example/out/Forest_5_watershed.las")
rgl::plot3d(las@data$X, las@data$Y,las@data$Z,
      xlab = "x", ylab = "y", zlab = "z", 
      type = "p", 
      col = as.numeric(las@data$crown_id),
      size = 2,
      main = "", sub = "", ann = FALSE, axes = FALSE)

5.4 Extra: lidR’s LAScatalog processing engine19

#' Apply the watershed workflow
#'
#' @details
#'
#' @template param-las
#'
#' @return If the input is a \code{LAS} object, return a LAS object.
#'         If the input is a \code{LAScatalog} las files are written to disk.
#'
#' @examples
#' # Attention: The example las file already contains a column "treeID".
#' #            Here we will save the tree crown IDs in "crown_id".
#'
#' # LAScatalog ################################################################
#'
#' library(lidR)
#' library(forestHPC)
#'
#' tmp_dir <- tempdir()
#' ctg <- forestHPC:::get_example_LAScatalog(path = tmp_dir)
#'
#' list.files(tmp_dir)
#'
#' # define LAScatalog options
#' lidR::opt_chunk_buffer(ctg) <- 5
#' lidR::opt_output_files(ctg) <- paste0(tmp_dir, "/{ORIGINALFILENAME}_watershed")
#' lidR::opt_select(ctg) <- "xyz"
#'
#'
#' ctg <- do_watershed(ctg)
#'
#' list.files(tmp_dir)
#'
#' las <- lidR::readLAS(ctg$filename[3])
#'
#' plot(las, color = "crown_id")
#'
#' @export
do_watershed <- function(las, species_map_path = NULL, data_set, buffer = NULL)
{
  UseMethod("do_watershed", las)
}


#' @export
do_watershed.LAS <- function(las, buffer = NULL)
{


  # Using raster because focal does not exist in stars
  chm <- rasterize_canopy(las, 0.5, p2r(0.3), pkg = "raster")
  ker <- matrix(1,3,3)
  chm <- raster::focal(chm, w = ker, fun = mean, na.rm = TRUE)

  ttops <- locate_trees(chm, lmf(4, 2))
  las   <- segment_trees(las, dalponte2016(chm, ttops), attribute="crown_id")


  return(las)


}


do_watershed.LAScluster <- function(las)
{
  x <- lidR::readLAS(las)
  if (lidR::is.empty(x)) return(NULL)

  x <- do_watershed(x,
              buffer = NULL)
  return(x)
}

#' @export
do_watershed.LAScatalog <- function(las)
{

  options <- list(automerge = TRUE)

  output <- lidR::catalog_apply(las,
                                do_watershed.LAScluster,
                                .options = options)

  return(output)
}

5.5 Issues

If R prints “During startup - Warning message: Setting LC_CTYPE failed, using”C”. Quit R (q()) and try:

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

code for methods in class […] was not checked for suspicious field assignments (recommended package ‘codetools’ not available?)

install.packages("codetools", lib = lib_path)

References

Kirchner, H. 2021. Large-scale tree detection and species classification from LiDAR: Combining mean shift clustering with random forest and deep learning.” Master’s thesis, Eberswalde University for Sustainable Development; Helmholtz Centre for Environmental Research (UFZ); unpublished.
Lindberg, E., and J. Holmgren. 2017. Individual Tree Crown Methods for 3D Data from Remote Sensing".” Current Forestry Reports 3 (1): 19–31. https://doi.org/10.1007/s40725-017-0051-6.
Roussel, J.-R., D. Auty, N. Coops, P. Tompalski, T. Goodbody, A. S. Meador, J.-F. Bourdon, Florian de Boissieu, and A. Achim. 2020. lidR: An R package for analysis of Airborne Laser Scanning (ALS) data.” Remote Sensing of Environment 251: 112061. https://doi.org/https://doi.org/10.1016/j.rse.2020.112061.
Roussel, J.-R., T. Goodbody, and P. Tompalski. 2021. The lidR package.” Available online: https://jean-romain.github.io/lidRbook/index.html (Accessed: 12.01.2021).
Shi, Yifang, Tiejun Wang, Andrew K. Skidmore, and Marco Heurich. 2018. Important LiDAR metrics for discriminating forest tree species in Central Europe 137: 163–74. https://doi.org/https://doi.org/10.1016/j.isprsjprs.2018.02.002.
The American Society for Photogrammetry & Remote Sensing. 2011. LAS Specification - Version 1.4.” Available online: ://www.asprs.org/wp-content/uploads/2010/12/LAS_1_4_r13.pdf (Accessed on: 17.06.2021).
Wulder, M., C. Bater, N. Coops, T. Hilker, and J. White. 2008. The role of LiDAR in sustainable forest management.” The Forestry Chronicle 84 (6): 807–26. https://doi.org/10.5558/tfc84807-6.

  1. Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen↩︎

  2. https://slurm.schedmd.com/↩︎

  3. https://en.wikipedia.org/wiki/Embarrassingly_parallel↩︎

  4. https://slurm.schedmd.com/job_array.html↩︎

  5. https://rstudio.gwdg.de/↩︎

  6. https://spack.readthedocs.io/en/latest/↩︎

  7. https://www.r-project.org/↩︎

  8. https://gdal.org/↩︎

  9. https://www.fftw.org/↩︎

  10. https://libgeos.org/↩︎

  11. https://github.com/r-lidar/lidR↩︎

  12. https://s.gwdg.de/K1X2Yl↩︎

  13. https://docs.gwdg.de/doku.php?id=en:services:application_services:high_performance_computing:running_jobs_slurm↩︎

  14. The command module avail shows all available modules.↩︎

  15. https://cran.r-project.org/web/packages/lidR/vignettes/lidR-LAScatalog-engine.html↩︎

  16. https://github.com/HenrikBengtsson/future.batchtools/issues/23↩︎

  17. Note: There is no need to replace $USER with your user name.↩︎

  18. https://cran.r-project.org/web/packages/lidR/vignettes/lidR-LAScatalog-class.html↩︎

  19. https://r-lidar.github.io/lidRbook/engine2.html↩︎