Buildbot: collectd-60-solaris10-sparc Build #21

Results:

Build successful

SourceStamp:

Project	collectd/collectd
Repository	https://github.com/collectd/collectd
Branch	collectd-6.0
Revision	662a189bc410c46f3ae13bb593b1de51f598c0c4
Got Revision	662a189bc410c46f3ae13bb593b1de51f598c0c4
Changes	1 change

BuildSlave:

unstable10s

Reason:

The AnyBranchScheduler scheduler named 'schedule-collectd-60' triggered this build

Steps and Logfiles:

git update ( 11 secs )
1. stdio
setproperty property 'ciflags' set ( 0 secs )
1. stdio
2. property changes
shell '/opt/csw/bin/bash ./build.sh' ( 5 mins, 13 secs )
1. stdio
shell_1 './configure --prefix=/opt/csw ...' ( 3 mins, 8 secs )
1. stdio
2. config.log
shell_2 'gmake -k ...' ( 9 mins, 20 secs )
1. stdio
shell_3 'gmake check' ( 1 mins, 27 secs )
1. stdio
2. test-suite.log

Build Properties:

Name	Value	Source
branch	collectd-6.0	Build
builddir	/export/home/buildbot-unstable10s/slave/collectd-60-solaris10-sparc	slave
buildername	collectd-60-solaris10-sparc	Builder
buildnumber	21	Build
ciflags	--disable-aggregation --disable-check_uptime --disable-csv --disable-java --disable-lua --disable-match_empty_counter --disable-match_hashed --disable-match_regex --disable-match_timediff --disable-match_value --disable-network --disable-perl --disable-postgresql --disable-target_notification --disable-target_replace --disable-target_scale --disable-target_set --disable-target_v5upgrade --disable-threshold --disable-write_graphite --disable-write_kafka --disable-write_mongodb --disable-write_pro .. [property value too long]	SetPropertyFromCommand Step
codebase		Build
got_revision	662a189bc410c46f3ae13bb593b1de51f598c0c4	Git
project	collectd/collectd	Build
repository	https://github.com/collectd/collectd	Build
revision	662a189bc410c46f3ae13bb593b1de51f598c0c4	Build
scheduler	schedule-collectd-60	Scheduler
slavename	unstable10s	BuildSlave
workdir	/export/home/buildbot-unstable10s/slave/collectd-60-solaris10-sparc	slave (deprecated)

Forced Build Properties:

Name	Label	Value

Responsible Users:

Eero Tamminen
eero.t.tamminenohnoyoudont@intel.com

Timing:

Start	Tue Jun 7 19:56:19 2022
End	Tue Jun 7 20:15:41 2022
Elapsed	19 mins, 21 secs

All Changes:

:

Change #155025

Category	None
Changed by	Eero Tamminen <eero.t.tamminenohnoyoudont@intel.com>
Changed at	Tue 07 Jun 2022 19:55:14
Repository	https://github.com/collectd/collectd
Project	collectd/collectd
Branch	collectd-6.0
Revision	662a189bc410c46f3ae13bb593b1de51f598c0c4

Comments

[collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics (#3968)

* Add 'gpu_sysman' plugin for (Intel) GPU metrics

Metrics data is provided by OneAPI Level Zero Sysman API.

* Add unit-testing for 'gpu_sysman' plugin

See comment at start of src/gpu_sysman_test.c for details.

* Integrate 'gpu_sysman' plugin and its unit-testing to collectd build

* Add 'gpu_sysman' plugin configuration and documentation

* gpu_sysman: use sizeof(*var) rather than sizeof(vartype) in var=calloc(...)

Except for gpu_subarray_alloc(), all allocs are done with calloc().
This way correctness of all of them is easy to check just by grepping
for calloc (especially now that clang-format does not wrap those lines
any more), and reviewing gpu_subarray_alloc().

* gpu_sysman: minimal v6 API support + add units to metric names

Prometheus & OpenMetrics require metric names to be suffixed by the
metric unit, and ratios (0-1) to be used instead of percentages
(0-100).

* gpu_sysman: update test code for minimal v6 API support + new metric names

There's now also support for multiple metrics per family although they
are not used yet. "sstrncpy" is not needed any more.

* gpu_sysman: split metric properties from their names to separate labels

Following labels are used:
- sub_dev: subdevice ID (unsigned integer)
- location: e.g. "gpu" / "memory"
- type: e.g. "request" / "actual"
- direction: "read" / "write"

Additionally:

* Two location label values were fixed

* GPU engine indeces are now per engine type
  (instead of single index being used for all types)

* All metric family and label names have been changed to use
  underscores instead of dashes to separate words, as required by
  Prometheus i.e. collectd does not need to convert them any more:
  https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

* gpu_sysman: update test code to handle metrics split with labels

NOTE: providing NULL as label value to delete it is NOT supported.
Test code will assert on labels with NULL values.

* gpu_sysman: remove "GPU-" prefix from name and add it "pci_pdf" label

Also rename GPU struct "name" member to more explicit "pci_bdf".

This allowed simplifying the code slightly.

Sysman API supports nowadays also other devices than GPUs, so prefix
is removed to to simplify code and to be more future-proof:
https://spec.oneapi.io/level-zero/latest/core/api.html#_CPPv416ze_device_type_t

(Plugin will still query only GPU devices from Sysman though.)

* gpu_sysman: fix test code for "pci_bdf" added to metrics family

- do not add "pci_pdf" to metric name for matching
- fix for adding metric labels to family copies of them

* gpu_sysman: improvements to reported metrics

* Fix memory "type" label overwrite

* Replace "free" memory metric with "memory_usage_ratio" one,
  and rename "memory_bytes" to "memory_used_bytes" metric

* Split metric value aggregate function name to a separate
  "function" label

* Have metric family declares always in same place in code

* Avoid both setting metric labels, and reporting empty metrics,
  when higher internal sampling rate is used or there are L0
  errors

* gpu_sysman: update tests for sysman plugin changes

* Add "memory_usage_ratio" checks

* Update validation for metrics that can be sampled at higher
  rate i.e. have now the new aggregate function label

* With empty metrics avoided, dispatch mock-up can assert on them

* With extra L0 calls being skipped when not needed, number of calls
  can differ between query rounds:
  - refactor multi-sampling test to handle count changes
  - change error handing checks to be done in single-sampled mode

* Debug output is needed to debug triggered multisample asserts,
  so do that when assert would have been triggered, then abort

* gpu_sysman: add help information for all metric families

And document why const-qual cast is safe, and why GCC does
not warn about other assignments to .name & .help members.

* gpu_sysman: option to disable utilization metrics for single engines

More powerful GPUs can have a large number of engines of given type,
but user may be interested only on the higher level engine groups
utilization.

"DisableEngineSingle" option allows skipping individual engine metrics.

* gpu_sysman: option for specifying metrics output type

This can be used to speciify whether output metrics values will be
raw, derived or both.

This commit add support just for the configuration option itself,
adding / changing metrics to use it happens in next commit.

* gpu_sysman: optional raw metrics output for already supported metrics

This adds new counter type metrics for:
* memory bandwidth
* frequency throttle time
* engine execution time (activity)
* energy usage

Because collecd internally handles counters as integers, all units
cannot be ones recommended by Prometheus, but microseconds and
microjoules reported by Sysman.

* gpu_sysman: skip metrics with div-by-zero or time wrap around issues

Zero time intervals or max bandwidth would cause div-by-zero issues
and (very rare) time wrap around would cause bogus metric value.
Skip all of them.

* gpu_sysman: fix test code -Wpedantic + -Wcast-qual warnings

* gpu_sysman: add 'sub_dev' and 'type' labels only when needed

Empty label equals to a missing one, and Prometheus queries can check
for non-existence of a label, so let's just skip empty / unneeded ones.

Main difference to earlier is that LevelZero error categories that
provide non-zero values only for uncorrectable type (according to
spec), are now without a type label. Correctable i.e. zero metrics for
those categories were skipped already earlier.

* Add "dev_file" label support

And contrib/format.sh include re-order.

"dev_file" support is behind a define (enabled by default) because it
needs functions that are only part of POSIX, not C99.

Intel Kubernetes GPU plugin uses primary GPU node device file names
(card0, card1...) as its GPU identifiers.  This new label helps in
mapping Kubernetes custom metrics to them.

* Move test defines from Sysman plugin to its test code

And document with what GCC warning options the code is tested / passes.

* Change strcpy() in Sysman plugin to sstrncpy()

While for plugin that change does not really help (as target buffer is
always larger than source), for test code it is useful. And it shuts
up less capabable static checking tools than GCC.

As test code cannot use existing collectd functionality for this (test
code needs modified versions of some collectd functions, and all
collectd code does not pass GCC warnings I use), sstrncpy() is copied
to test code.

For test code there's also a fix to size given for snprintf(), and
removal of redundant string termination for modified plugin_log() copy
(vsnprintf() already terminates string).

* Pass clang-format check for gpu_sysman_test.c comments

* Add scalloc() wrapper similar to smalloc() to common utils

scalloc() wraps calloc() with exit on alloc failure,
similarly to what smalloc() does for malloc().

* Replace Sysman plugin alloc+assert calls with smalloc/scalloc

If asserts were disabled, allocation failures would result in collectd
memory errors => replace alloc+assert in the plugin with collectd
smalloc/scalloc wrappers that exits after logging allocation error.

Downsides are that this does not invoke debugger (which could be in a
different control group with plenty of memory), nor tell where / what
allocation failed, like enabled assert would, so test code variants of
the wrappers still do asserts.

* Pass clang-format check for gpu_sysman_test.c

Changed files

Makefile.am
README
configure.ac
src/collectd.conf.in
src/collectd.conf.pod
src/gpu_sysman.c
src/gpu_sysman_test.c
src/utils/common/common.c
src/utils/common/common.h

Builder collectd-60-solaris10-sparc Build #21

Results:

SourceStamp:

BuildSlave:

Reason:

Steps and Logfiles:

Build Properties:

Forced Build Properties:

Responsible Users:

Timing:

All Changes:

:

Change #155025

Comments

Changed files