GPU data
Hi, Rosemarie.
Finally there are GPU data available and below is the data that is currently collected and sent to JSON (we can also think about other GPU information that we might need). Every node may have multiple GPUS, so for every GPU there is a dedicated set of values with unique bus
value for every GPU on the particular node.
The data in JSON
looks as follows:
job: ...
nodes: [
node_name: "YY",
...
cpu_usage: XX,
...
gpus: [
bus: string, - BUS ID of the GPU
power_limit: float, - power limit in Watts
mem: integer, - total memory in MB
mem_max: integer, - used memory in MB. HWM (high water mark)
temp_max: integer, - GPU temperature in C. HWM
power_max: float, - HWM of power used in Watts
usage_max: float, - HWM of GPU utilization
usage_avg: float, - mean of GPU utilization
cpu_usage_max: float, - HWM of CPU usage of processes using GPU in Bytes
cpu_mem_rss_max: integer, - HWM of RSS memory of processes using GPU
cpu_proc_total: integer, - total number of unique processes which used GPU
dynamic: { - same structure as for nodes->dynamic. delta contains time diff and data contains points.
"seq_mem_avg": integer - time series of GPU memory usage in MB
"seq_usage_avg": float - time series of GPU utilization in percentages (max 100)
"seq_temp_avg": integer - time series of GPU temperature in Celsius
"seq_power_avg": float - time series of GPU power consumption in Watts
"seq_cpu_usage_sum": float - the sum of CPU usages of all processes on GPU in period delta in percentages
"seq_cpu_mem_rss_sum": integer - the sum of RSS memory consumed by all CPU processes on GPU in period delta in Bytes
"seq_cpu_proc_count": integer - the number of CPU processes using GPU in period delta (total amount per interval)
}
]
]
Not all metrics should be represented in pdf-report
, it is still open for a discussion.
Test data can be found in 2365880.pdf.json
Edited by Azat Khuziyakhmetov