Home - Waterfall Grid T-Grid Console Builders Recent Builds Buildslaves Changesources - JSON API - About

Builder collectd-60-solaris10-sparc Build #21

Results:

Build successful

SourceStamp:

Projectcollectd/collectd
Repositoryhttps://github.com/collectd/collectd
Branchcollectd-6.0
Revision662a189bc410c46f3ae13bb593b1de51f598c0c4
Got Revision662a189bc410c46f3ae13bb593b1de51f598c0c4
Changes1 change

BuildSlave:

unstable10s

Reason:

The AnyBranchScheduler scheduler named 'schedule-collectd-60' triggered this build

Steps and Logfiles:

  1. git update ( 11 secs )
    1. stdio
  2. setproperty property 'ciflags' set ( 0 secs )
    1. stdio
    2. property changes
  3. shell '/opt/csw/bin/bash ./build.sh' ( 5 mins, 13 secs )
    1. stdio
  4. shell_1 './configure --prefix=/opt/csw ...' ( 3 mins, 8 secs )
    1. stdio
    2. config.log
  5. shell_2 'gmake -k ...' ( 9 mins, 20 secs )
    1. stdio
  6. shell_3 'gmake check' ( 1 mins, 27 secs )
    1. stdio
    2. test-suite.log

Build Properties:

NameValueSource
branch collectd-6.0 Build
builddir /export/home/buildbot-unstable10s/slave/collectd-60-solaris10-sparc slave
buildername collectd-60-solaris10-sparc Builder
buildnumber 21 Build
ciflags --disable-aggregation --disable-check_uptime --disable-csv --disable-java --disable-lua --disable-match_empty_counter --disable-match_hashed --disable-match_regex --disable-match_timediff --disable-match_value --disable-network --disable-perl --disable-postgresql --disable-target_notification --disable-target_replace --disable-target_scale --disable-target_set --disable-target_v5upgrade --disable-threshold --disable-write_graphite --disable-write_kafka --disable-write_mongodb --disable-write_pro .. [property value too long] SetPropertyFromCommand Step
codebase Build
got_revision 662a189bc410c46f3ae13bb593b1de51f598c0c4 Git
project collectd/collectd Build
repository https://github.com/collectd/collectd Build
revision 662a189bc410c46f3ae13bb593b1de51f598c0c4 Build
scheduler schedule-collectd-60 Scheduler
slavename unstable10s BuildSlave
workdir /export/home/buildbot-unstable10s/slave/collectd-60-solaris10-sparc slave (deprecated)

Forced Build Properties:

NameLabelValue

Responsible Users:

  1. Eero Tamminen

Timing:

StartTue Jun 7 19:56:19 2022
EndTue Jun 7 20:15:41 2022
Elapsed19 mins, 21 secs

All Changes:

:

  1. Change #155025

    Category None
    Changed by Eero Tamminen <eero.t.tamminenohnoyoudont@intel.com>
    Changed at Tue 07 Jun 2022 19:55:14
    Repository https://github.com/collectd/collectd
    Project collectd/collectd
    Branch collectd-6.0
    Revision 662a189bc410c46f3ae13bb593b1de51f598c0c4

    Comments

    [collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics (#3968)
    
    * Add 'gpu_sysman' plugin for (Intel) GPU metrics
    
    Metrics data is provided by OneAPI Level Zero Sysman API.
    
    * Add unit-testing for 'gpu_sysman' plugin
    
    See comment at start of src/gpu_sysman_test.c for details.
    
    * Integrate 'gpu_sysman' plugin and its unit-testing to collectd build
    
    * Add 'gpu_sysman' plugin configuration and documentation
    
    * gpu_sysman: use sizeof(*var) rather than sizeof(vartype) in var=calloc(...)
    
    Except for gpu_subarray_alloc(), all allocs are done with calloc().
    This way correctness of all of them is easy to check just by grepping
    for calloc (especially now that clang-format does not wrap those lines
    any more), and reviewing gpu_subarray_alloc().
    
    * gpu_sysman: minimal v6 API support + add units to metric names
    
    Prometheus & OpenMetrics require metric names to be suffixed by the
    metric unit, and ratios (0-1) to be used instead of percentages
    (0-100).
    
    * gpu_sysman: update test code for minimal v6 API support + new metric names
    
    There's now also support for multiple metrics per family although they
    are not used yet. "sstrncpy" is not needed any more.
    
    * gpu_sysman: split metric properties from their names to separate labels
    
    Following labels are used:
    - sub_dev: subdevice ID (unsigned integer)
    - location: e.g. "gpu" / "memory"
    - type: e.g. "request" / "actual"
    - direction: "read" / "write"
    
    Additionally:
    
    * Two location label values were fixed
    
    * GPU engine indeces are now per engine type
      (instead of single index being used for all types)
    
    * All metric family and label names have been changed to use
      underscores instead of dashes to separate words, as required by
      Prometheus i.e. collectd does not need to convert them any more:
      https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels
    
    * gpu_sysman: update test code to handle metrics split with labels
    
    NOTE: providing NULL as label value to delete it is NOT supported.
    Test code will assert on labels with NULL values.
    
    * gpu_sysman: remove "GPU-" prefix from name and add it "pci_pdf" label
    
    Also rename GPU struct "name" member to more explicit "pci_bdf".
    
    This allowed simplifying the code slightly.
    
    Sysman API supports nowadays also other devices than GPUs, so prefix
    is removed to to simplify code and to be more future-proof:
    https://spec.oneapi.io/level-zero/latest/core/api.html#_CPPv416ze_device_type_t
    
    (Plugin will still query only GPU devices from Sysman though.)
    
    * gpu_sysman: fix test code for "pci_bdf" added to metrics family
    
    - do not add "pci_pdf" to metric name for matching
    - fix for adding metric labels to family copies of them
    
    * gpu_sysman: improvements to reported metrics
    
    * Fix memory "type" label overwrite
    
    * Replace "free" memory metric with "memory_usage_ratio" one,
      and rename "memory_bytes" to "memory_used_bytes" metric
    
    * Split metric value aggregate function name to a separate
      "function" label
    
    * Have metric family declares always in same place in code
    
    * Avoid both setting metric labels, and reporting empty metrics,
      when higher internal sampling rate is used or there are L0
      errors
    
    * gpu_sysman: update tests for sysman plugin changes
    
    * Add "memory_usage_ratio" checks
    
    * Update validation for metrics that can be sampled at higher
      rate i.e. have now the new aggregate function label
    
    * With empty metrics avoided, dispatch mock-up can assert on them
    
    * With extra L0 calls being skipped when not needed, number of calls
      can differ between query rounds:
      - refactor multi-sampling test to handle count changes
      - change error handing checks to be done in single-sampled mode
    
    * Debug output is needed to debug triggered multisample asserts,
      so do that when assert would have been triggered, then abort
    
    * gpu_sysman: add help information for all metric families
    
    And document why const-qual cast is safe, and why GCC does
    not warn about other assignments to .name & .help members.
    
    * gpu_sysman: option to disable utilization metrics for single engines
    
    More powerful GPUs can have a large number of engines of given type,
    but user may be interested only on the higher level engine groups
    utilization.
    
    "DisableEngineSingle" option allows skipping individual engine metrics.
    
    * gpu_sysman: option for specifying metrics output type
    
    This can be used to speciify whether output metrics values will be
    raw, derived or both.
    
    This commit add support just for the configuration option itself,
    adding / changing metrics to use it happens in next commit.
    
    * gpu_sysman: optional raw metrics output for already supported metrics
    
    This adds new counter type metrics for:
    * memory bandwidth
    * frequency throttle time
    * engine execution time (activity)
    * energy usage
    
    Because collecd internally handles counters as integers, all units
    cannot be ones recommended by Prometheus, but microseconds and
    microjoules reported by Sysman.
    
    * gpu_sysman: skip metrics with div-by-zero or time wrap around issues
    
    Zero time intervals or max bandwidth would cause div-by-zero issues
    and (very rare) time wrap around would cause bogus metric value.
    Skip all of them.
    
    * gpu_sysman: fix test code -Wpedantic + -Wcast-qual warnings
    
    * gpu_sysman: add 'sub_dev' and 'type' labels only when needed
    
    Empty label equals to a missing one, and Prometheus queries can check
    for non-existence of a label, so let's just skip empty / unneeded ones.
    
    Main difference to earlier is that LevelZero error categories that
    provide non-zero values only for uncorrectable type (according to
    spec), are now without a type label. Correctable i.e. zero metrics for
    those categories were skipped already earlier.
    
    * Add "dev_file" label support
    
    And contrib/format.sh include re-order.
    
    "dev_file" support is behind a define (enabled by default) because it
    needs functions that are only part of POSIX, not C99.
    
    Intel Kubernetes GPU plugin uses primary GPU node device file names
    (card0, card1...) as its GPU identifiers.  This new label helps in
    mapping Kubernetes custom metrics to them.
    
    * Move test defines from Sysman plugin to its test code
    
    And document with what GCC warning options the code is tested / passes.
    
    * Change strcpy() in Sysman plugin to sstrncpy()
    
    While for plugin that change does not really help (as target buffer is
    always larger than source), for test code it is useful. And it shuts
    up less capabable static checking tools than GCC.
    
    As test code cannot use existing collectd functionality for this (test
    code needs modified versions of some collectd functions, and all
    collectd code does not pass GCC warnings I use), sstrncpy() is copied
    to test code.
    
    For test code there's also a fix to size given for snprintf(), and
    removal of redundant string termination for modified plugin_log() copy
    (vsnprintf() already terminates string).
    
    * Pass clang-format check for gpu_sysman_test.c comments
    
    * Add scalloc() wrapper similar to smalloc() to common utils
    
    scalloc() wraps calloc() with exit on alloc failure,
    similarly to what smalloc() does for malloc().
    
    * Replace Sysman plugin alloc+assert calls with smalloc/scalloc
    
    If asserts were disabled, allocation failures would result in collectd
    memory errors => replace alloc+assert in the plugin with collectd
    smalloc/scalloc wrappers that exits after logging allocation error.
    
    Downsides are that this does not invoke debugger (which could be in a
    different control group with plenty of memory), nor tell where / what
    allocation failed, like enabled assert would, so test code variants of
    the wrappers still do asserts.
    
    * Pass clang-format check for gpu_sysman_test.c

    Changed files

    • Makefile.am
    • README
    • configure.ac
    • src/collectd.conf.in
    • src/collectd.conf.pod
    • src/gpu_sysman.c
    • src/gpu_sysman_test.c
    • src/utils/common/common.c
    • src/utils/common/common.h