Skip to content

Conversation

@sarah-witt
Copy link
Contributor

What does this PR do?

Adds a readme for ibm spectrum lsf, as well as documentation for metrics. Also specifies which metrics are monitored by which parameters.

Motivation

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged
@codecov
Copy link

codecov bot commented Dec 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.07%. Comparing base (6d8718a) to head (fc08a7b).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@sarah-witt sarah-witt marked this pull request as ready for review December 5, 2025 21:04
@sarah-witt sarah-witt requested review from a team as code owners December 5, 2025 21:04
@sarah-witt sarah-witt changed the title Update readme and metrics documentation [AI-5939] Update readme and metrics documentation Dec 5, 2025
@iadjivon
Copy link
Contributor

iadjivon commented Dec 5, 2025

Hi Sarah, thanks for this PR, I've added this to our board for editorial review: DOCS-12849

@iadjivon iadjivon added the editorial review Waiting on a more in-depth review from a docs team editor label Dec 5, 2025
@github-actions
Copy link

github-actions bot commented Dec 5, 2025

⚠️ Recommendation: Add qa/skip-qa label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

Copy link
Contributor

@evazorro evazorro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the README! I added some style/wording suggestions and then a couple bigger formatting notes that will be easier for you to change locally.


Add the `dd-agent` user as an LSF [administrator][10].

The integration runs commands such as `lsid`, `bhosts`, and `lsclusters`. In order to run these commands, the Agent needs them in its `PATH`. This is typically done by running `source $LSF_HOME/conf/profile.lsf`. However, the Datadog Agent uses upstart or systemd to orchestrate the datadog-agent service. Environment variables may need to be added to the service configuration files at the default locations of:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The integration runs commands such as `lsid`, `bhosts`, and `lsclusters`. In order to run these commands, the Agent needs them in its `PATH`. This is typically done by running `source $LSF_HOME/conf/profile.lsf`. However, the Datadog Agent uses upstart or systemd to orchestrate the datadog-agent service. Environment variables may need to be added to the service configuration files at the default locations of:
The integration runs commands such as `lsid`, `bhosts`, and `lsclusters`. In order to run these commands, the Agent needs them in its `PATH`. This is typically done by running `source $LSF_HOME/conf/profile.lsf`. However, the Datadog Agent uses upstart or systemd to orchestrate the `datadog-agent` service. You may need to add environment variables to the service configuration files at the default locations of:

To get the enviornment variables necessary for the agent service, locate the `<LSF_TOP_DIR>/conf/profile.lsf` file and run the following command:

`env -i bash -c "source <LSF_TOP_DIR>/conf/profile.lsf; env"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this is just one line, I would put it in a standalone code block (three backticks) rather than inline code formatting (one backtick). It'll be easier for customers to copy/paste the command in a standalone code block.


`env -i bash -c "source <LSF_TOP_DIR>/conf/profile.lsf; env"`

This will output a list of environment variables necessary to run the IBM Spectrum LSF commands.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This will output a list of environment variables necessary to run the IBM Spectrum LSF commands.
Running this command outputs a list of environment variables necessary to run the IBM Spectrum LSF commands.
- Upstart: `/etc/init/datadog-agent.conf`
- Systemd: `/lib/systemd/system/datadog-agent.service`

To get the enviornment variables necessary for the agent service, locate the `<LSF_TOP_DIR>/conf/profile.lsf` file and run the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To get the enviornment variables necessary for the agent service, locate the `<LSF_TOP_DIR>/conf/profile.lsf` file and run the following command:
To get the environment variables necessary for the Agent service, locate the `<LSF_TOP_DIR>/conf/profile.lsf` file and run the following command:

## Troubleshooting

Use the `datadog-agent check` command to view the metrics the integration is collection, as well as debug logs from the check:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Use the `datadog-agent check` command to view the metrics the integration is collection, as well as debug logs from the check:
Use the `datadog-agent check` command to view the metrics the integration is collecting, as well as debug logs from the check:

1. Edit the `ibm_spectrum_lsf.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your `ibm_spectrum_lsf` performance data. See the [sample ibm_spectrum_lsf.d/conf.yaml][4] for all available configuration options.

The IBM Spectrum LSF integration will run a series of management commands to collect data. To control what commands are run and what metrics are emitted, use the `metric_sources` configuration option. By default, data from the following commands are collected: `lsclusters`, `lshosts`, `bhosts`, `lsload`, `bqueues`, `bslots`, `bjobs`, but you can enable more optional metrics or opt-out of collecting any set of metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The IBM Spectrum LSF integration will run a series of management commands to collect data. To control what commands are run and what metrics are emitted, use the `metric_sources` configuration option. By default, data from the following commands are collected: `lsclusters`, `lshosts`, `bhosts`, `lsload`, `bqueues`, `bslots`, `bjobs`, but you can enable more optional metrics or opt-out of collecting any set of metrics.
The IBM Spectrum LSF integration runs a series of management commands to collect data. To control which commands are run and which metrics are emitted, use the `metric_sources` configuration option. By default, data from the following commands are collected, but you can enable more optional metrics or opt out of collecting any set of metrics: `lsclusters`, `lshosts`, `bhosts`, `lsload`, `bqueues`, `bslots`, `bjobs`.

The IBM Spectrum LSF integration will run a series of management commands to collect data. To control what commands are run and what metrics are emitted, use the `metric_sources` configuration option. By default, data from the following commands are collected: `lsclusters`, `lshosts`, `bhosts`, `lsload`, `bqueues`, `bslots`, `bjobs`, but you can enable more optional metrics or opt-out of collecting any set of metrics.

For example, if you would like to measure only GPU specific metrics, your metric sources will look like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, if you would like to measure only GPU specific metrics, your metric sources will look like:
For example, if you want to only measure GPU-specific metrics, your `metrics_sources` will look like:
- bhosts_gpu
```

The `badmin_perfmon` metric source collects fata from the `badmin perfmon view -json` command. This collects [overall statistics][12] about the cluster. To collect these metrics, performance collection must be enabled on your server using the `badmin perfmon start <COLLECTION_INTERVAL>` command. By default, the integration will run this command automatically (and stop collection once the agent is turned off). However, you can turn off this behavior by setting `badmin_perfmon_auto: false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `badmin_perfmon` metric source collects fata from the `badmin perfmon view -json` command. This collects [overall statistics][12] about the cluster. To collect these metrics, performance collection must be enabled on your server using the `badmin perfmon start <COLLECTION_INTERVAL>` command. By default, the integration will run this command automatically (and stop collection once the agent is turned off). However, you can turn off this behavior by setting `badmin_perfmon_auto: false`.
The `badmin_perfmon` metric source collects data from the `badmin perfmon view -json` command. This collects [overall statistics][12] about the cluster. To collect these metrics, performance collection must be enabled on your server using the `badmin perfmon start <COLLECTION_INTERVAL>` command. By default, the integration runs this command automatically (and stops collection once the Agent is turned off). However, you can turn off this behavior by setting `badmin_perfmon_auto: false`.

The `badmin_perfmon` metric source collects fata from the `badmin perfmon view -json` command. This collects [overall statistics][12] about the cluster. To collect these metrics, performance collection must be enabled on your server using the `badmin perfmon start <COLLECTION_INTERVAL>` command. By default, the integration will run this command automatically (and stop collection once the agent is turned off). However, you can turn off this behavior by setting `badmin_perfmon_auto: false`.

Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact depends on the load and size of your cluster. View IBM Spectrum LSF's [recommendations][13] for managing high query load.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact depends on the load and size of your cluster. View IBM Spectrum LSF's [recommendations][13] for managing high query load.
Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact interval depends on the load and size of your cluster. View IBM Spectrum LSF's [recommendations][13] for managing high query load.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exact interval? Exact number?


Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact depends on the load and size of your cluster. View IBM Spectrum LSF's [recommendations][13] for managing high query load.

Similarly, the `bhist` command collects information about completed jobs, which can be query intensive so we recommend monitoring this command with the `min_collection_interval` set to 60.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Similarly, the `bhist` command collects information about completed jobs, which can be query intensive so we recommend monitoring this command with the `min_collection_interval` set to 60.
Similarly, the `bhist` command collects information about completed jobs, which can be query-intensive, so we recommend monitoring this command with the `min_collection_interval` set to 60.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment