Builder collectd-60-solaris10-sparc Build #21
Results:
Build successful
SourceStamp:
Project | collectd/collectd |
Repository | https://github.com/collectd/collectd |
Branch | collectd-6.0 |
Revision | 662a189bc410c46f3ae13bb593b1de51f598c0c4 |
Got Revision | 662a189bc410c46f3ae13bb593b1de51f598c0c4 |
Changes | 1 change |
BuildSlave:
unstable10sReason:
The AnyBranchScheduler scheduler named 'schedule-collectd-60' triggered this build
Steps and Logfiles:
Build Properties:
Name | Value | Source |
---|---|---|
branch | collectd-6.0 | Build |
builddir | /export/home/buildbot-unstable10s/slave/collectd-60-solaris10-sparc | slave |
buildername | collectd-60-solaris10-sparc | Builder |
buildnumber | 21 | Build |
ciflags | --disable-aggregation --disable-check_uptime --disable-csv --disable-java --disable-lua --disable-match_empty_counter --disable-match_hashed --disable-match_regex --disable-match_timediff --disable-match_value --disable-network --disable-perl --disable-postgresql --disable-target_notification --disable-target_replace --disable-target_scale --disable-target_set --disable-target_v5upgrade --disable-threshold --disable-write_graphite --disable-write_kafka --disable-write_mongodb --disable-write_pro .. [property value too long] | SetPropertyFromCommand Step |
codebase | Build | |
got_revision | 662a189bc410c46f3ae13bb593b1de51f598c0c4 | Git |
project | collectd/collectd | Build |
repository | https://github.com/collectd/collectd | Build |
revision | 662a189bc410c46f3ae13bb593b1de51f598c0c4 | Build |
scheduler | schedule-collectd-60 | Scheduler |
slavename | unstable10s | BuildSlave |
workdir | /export/home/buildbot-unstable10s/slave/collectd-60-solaris10-sparc | slave (deprecated) |
Forced Build Properties:
Name | Label | Value |
---|
Responsible Users:
- Eero Tammineneero.t.tamminen@intel.com
Timing:
Start | Tue Jun 7 19:56:19 2022 |
End | Tue Jun 7 20:15:41 2022 |
Elapsed | 19 mins, 21 secs |
All Changes:
:
Change #155025
Category None Changed by Eero Tamminen <eero.t.tamminen @intel.com>Changed at Tue 07 Jun 2022 19:55:14 Repository https://github.com/collectd/collectd Project collectd/collectd Branch collectd-6.0 Revision 662a189bc410c46f3ae13bb593b1de51f598c0c4 Comments
[collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics (#3968) * Add 'gpu_sysman' plugin for (Intel) GPU metrics Metrics data is provided by OneAPI Level Zero Sysman API. * Add unit-testing for 'gpu_sysman' plugin See comment at start of src/gpu_sysman_test.c for details. * Integrate 'gpu_sysman' plugin and its unit-testing to collectd build * Add 'gpu_sysman' plugin configuration and documentation * gpu_sysman: use sizeof(*var) rather than sizeof(vartype) in var=calloc(...) Except for gpu_subarray_alloc(), all allocs are done with calloc(). This way correctness of all of them is easy to check just by grepping for calloc (especially now that clang-format does not wrap those lines any more), and reviewing gpu_subarray_alloc(). * gpu_sysman: minimal v6 API support + add units to metric names Prometheus & OpenMetrics require metric names to be suffixed by the metric unit, and ratios (0-1) to be used instead of percentages (0-100). * gpu_sysman: update test code for minimal v6 API support + new metric names There's now also support for multiple metrics per family although they are not used yet. "sstrncpy" is not needed any more. * gpu_sysman: split metric properties from their names to separate labels Following labels are used: - sub_dev: subdevice ID (unsigned integer) - location: e.g. "gpu" / "memory" - type: e.g. "request" / "actual" - direction: "read" / "write" Additionally: * Two location label values were fixed * GPU engine indeces are now per engine type (instead of single index being used for all types) * All metric family and label names have been changed to use underscores instead of dashes to separate words, as required by Prometheus i.e. collectd does not need to convert them any more: https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels * gpu_sysman: update test code to handle metrics split with labels NOTE: providing NULL as label value to delete it is NOT supported. Test code will assert on labels with NULL values. * gpu_sysman: remove "GPU-" prefix from name and add it "pci_pdf" label Also rename GPU struct "name" member to more explicit "pci_bdf". This allowed simplifying the code slightly. Sysman API supports nowadays also other devices than GPUs, so prefix is removed to to simplify code and to be more future-proof: https://spec.oneapi.io/level-zero/latest/core/api.html#_CPPv416ze_device_type_t (Plugin will still query only GPU devices from Sysman though.) * gpu_sysman: fix test code for "pci_bdf" added to metrics family - do not add "pci_pdf" to metric name for matching - fix for adding metric labels to family copies of them * gpu_sysman: improvements to reported metrics * Fix memory "type" label overwrite * Replace "free" memory metric with "memory_usage_ratio" one, and rename "memory_bytes" to "memory_used_bytes" metric * Split metric value aggregate function name to a separate "function" label * Have metric family declares always in same place in code * Avoid both setting metric labels, and reporting empty metrics, when higher internal sampling rate is used or there are L0 errors * gpu_sysman: update tests for sysman plugin changes * Add "memory_usage_ratio" checks * Update validation for metrics that can be sampled at higher rate i.e. have now the new aggregate function label * With empty metrics avoided, dispatch mock-up can assert on them * With extra L0 calls being skipped when not needed, number of calls can differ between query rounds: - refactor multi-sampling test to handle count changes - change error handing checks to be done in single-sampled mode * Debug output is needed to debug triggered multisample asserts, so do that when assert would have been triggered, then abort * gpu_sysman: add help information for all metric families And document why const-qual cast is safe, and why GCC does not warn about other assignments to .name & .help members. * gpu_sysman: option to disable utilization metrics for single engines More powerful GPUs can have a large number of engines of given type, but user may be interested only on the higher level engine groups utilization. "DisableEngineSingle" option allows skipping individual engine metrics. * gpu_sysman: option for specifying metrics output type This can be used to speciify whether output metrics values will be raw, derived or both. This commit add support just for the configuration option itself, adding / changing metrics to use it happens in next commit. * gpu_sysman: optional raw metrics output for already supported metrics This adds new counter type metrics for: * memory bandwidth * frequency throttle time * engine execution time (activity) * energy usage Because collecd internally handles counters as integers, all units cannot be ones recommended by Prometheus, but microseconds and microjoules reported by Sysman. * gpu_sysman: skip metrics with div-by-zero or time wrap around issues Zero time intervals or max bandwidth would cause div-by-zero issues and (very rare) time wrap around would cause bogus metric value. Skip all of them. * gpu_sysman: fix test code -Wpedantic + -Wcast-qual warnings * gpu_sysman: add 'sub_dev' and 'type' labels only when needed Empty label equals to a missing one, and Prometheus queries can check for non-existence of a label, so let's just skip empty / unneeded ones. Main difference to earlier is that LevelZero error categories that provide non-zero values only for uncorrectable type (according to spec), are now without a type label. Correctable i.e. zero metrics for those categories were skipped already earlier. * Add "dev_file" label support And contrib/format.sh include re-order. "dev_file" support is behind a define (enabled by default) because it needs functions that are only part of POSIX, not C99. Intel Kubernetes GPU plugin uses primary GPU node device file names (card0, card1...) as its GPU identifiers. This new label helps in mapping Kubernetes custom metrics to them. * Move test defines from Sysman plugin to its test code And document with what GCC warning options the code is tested / passes. * Change strcpy() in Sysman plugin to sstrncpy() While for plugin that change does not really help (as target buffer is always larger than source), for test code it is useful. And it shuts up less capabable static checking tools than GCC. As test code cannot use existing collectd functionality for this (test code needs modified versions of some collectd functions, and all collectd code does not pass GCC warnings I use), sstrncpy() is copied to test code. For test code there's also a fix to size given for snprintf(), and removal of redundant string termination for modified plugin_log() copy (vsnprintf() already terminates string). * Pass clang-format check for gpu_sysman_test.c comments * Add scalloc() wrapper similar to smalloc() to common utils scalloc() wraps calloc() with exit on alloc failure, similarly to what smalloc() does for malloc(). * Replace Sysman plugin alloc+assert calls with smalloc/scalloc If asserts were disabled, allocation failures would result in collectd memory errors => replace alloc+assert in the plugin with collectd smalloc/scalloc wrappers that exits after logging allocation error. Downsides are that this does not invoke debugger (which could be in a different control group with plenty of memory), nor tell where / what allocation failed, like enabled assert would, so test code variants of the wrappers still do asserts. * Pass clang-format check for gpu_sysman_test.c
Changed files
- Makefile.am
- README
- configure.ac
- src/collectd.conf.in
- src/collectd.conf.pod
- src/gpu_sysman.c
- src/gpu_sysman_test.c
- src/utils/common/common.c
- src/utils/common/common.h