- Notifications
You must be signed in to change notification settings - Fork 1.5k
[AI-5939] Update readme and metrics documentation #22069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files🚀 New features to boost your workflow:
|
| Hi Sarah, thanks for this PR, I've added this to our board for editorial review: DOCS-12849 |
|
This PR does not modify any files shipped with the agent. To help streamline the release process, please consider adding the |
evazorro left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the README! I added some style/wording suggestions and then a couple bigger formatting notes that will be easier for you to change locally.
| | ||
| Add the `dd-agent` user as an LSF [administrator][10]. | ||
| | ||
| The integration runs commands such as `lsid`, `bhosts`, and `lsclusters`. In order to run these commands, the Agent needs them in its `PATH`. This is typically done by running `source $LSF_HOME/conf/profile.lsf`. However, the Datadog Agent uses upstart or systemd to orchestrate the datadog-agent service. Environment variables may need to be added to the service configuration files at the default locations of: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The integration runs commands such as `lsid`, `bhosts`, and `lsclusters`. In order to run these commands, the Agent needs them in its `PATH`. This is typically done by running `source $LSF_HOME/conf/profile.lsf`. However, the Datadog Agent uses upstart or systemd to orchestrate the datadog-agent service. Environment variables may need to be added to the service configuration files at the default locations of: | |
| The integration runs commands such as `lsid`, `bhosts`, and `lsclusters`. In order to run these commands, the Agent needs them in its `PATH`. This is typically done by running `source $LSF_HOME/conf/profile.lsf`. However, the Datadog Agent uses upstart or systemd to orchestrate the `datadog-agent` service. You may need to add environment variables to the service configuration files at the default locations of: |
| | ||
| To get the enviornment variables necessary for the agent service, locate the `<LSF_TOP_DIR>/conf/profile.lsf` file and run the following command: | ||
| | ||
| `env -i bash -c "source <LSF_TOP_DIR>/conf/profile.lsf; env"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this is just one line, I would put it in a standalone code block (three backticks) rather than inline code formatting (one backtick). It'll be easier for customers to copy/paste the command in a standalone code block.
| | ||
| `env -i bash -c "source <LSF_TOP_DIR>/conf/profile.lsf; env"` | ||
| | ||
| This will output a list of environment variables necessary to run the IBM Spectrum LSF commands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This will output a list of environment variables necessary to run the IBM Spectrum LSF commands. | |
| Running this command outputs a list of environment variables necessary to run the IBM Spectrum LSF commands. |
| - Upstart: `/etc/init/datadog-agent.conf` | ||
| - Systemd: `/lib/systemd/system/datadog-agent.service` | ||
| | ||
| To get the enviornment variables necessary for the agent service, locate the `<LSF_TOP_DIR>/conf/profile.lsf` file and run the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| To get the enviornment variables necessary for the agent service, locate the `<LSF_TOP_DIR>/conf/profile.lsf` file and run the following command: | |
| To get the environment variables necessary for the Agent service, locate the `<LSF_TOP_DIR>/conf/profile.lsf` file and run the following command: |
| | ||
| ## Troubleshooting | ||
| | ||
| Use the `datadog-agent check` command to view the metrics the integration is collection, as well as debug logs from the check: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Use the `datadog-agent check` command to view the metrics the integration is collection, as well as debug logs from the check: | |
| Use the `datadog-agent check` command to view the metrics the integration is collecting, as well as debug logs from the check: |
| | ||
| 1. Edit the `ibm_spectrum_lsf.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your `ibm_spectrum_lsf` performance data. See the [sample ibm_spectrum_lsf.d/conf.yaml][4] for all available configuration options. | ||
| | ||
| The IBM Spectrum LSF integration will run a series of management commands to collect data. To control what commands are run and what metrics are emitted, use the `metric_sources` configuration option. By default, data from the following commands are collected: `lsclusters`, `lshosts`, `bhosts`, `lsload`, `bqueues`, `bslots`, `bjobs`, but you can enable more optional metrics or opt-out of collecting any set of metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The IBM Spectrum LSF integration will run a series of management commands to collect data. To control what commands are run and what metrics are emitted, use the `metric_sources` configuration option. By default, data from the following commands are collected: `lsclusters`, `lshosts`, `bhosts`, `lsload`, `bqueues`, `bslots`, `bjobs`, but you can enable more optional metrics or opt-out of collecting any set of metrics. | |
| The IBM Spectrum LSF integration runs a series of management commands to collect data. To control which commands are run and which metrics are emitted, use the `metric_sources` configuration option. By default, data from the following commands are collected, but you can enable more optional metrics or opt out of collecting any set of metrics: `lsclusters`, `lshosts`, `bhosts`, `lsload`, `bqueues`, `bslots`, `bjobs`. |
| | ||
| The IBM Spectrum LSF integration will run a series of management commands to collect data. To control what commands are run and what metrics are emitted, use the `metric_sources` configuration option. By default, data from the following commands are collected: `lsclusters`, `lshosts`, `bhosts`, `lsload`, `bqueues`, `bslots`, `bjobs`, but you can enable more optional metrics or opt-out of collecting any set of metrics. | ||
| | ||
| For example, if you would like to measure only GPU specific metrics, your metric sources will look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For example, if you would like to measure only GPU specific metrics, your metric sources will look like: | |
| For example, if you want to only measure GPU-specific metrics, your `metrics_sources` will look like: |
| - bhosts_gpu | ||
| ``` | ||
| | ||
| The `badmin_perfmon` metric source collects fata from the `badmin perfmon view -json` command. This collects [overall statistics][12] about the cluster. To collect these metrics, performance collection must be enabled on your server using the `badmin perfmon start <COLLECTION_INTERVAL>` command. By default, the integration will run this command automatically (and stop collection once the agent is turned off). However, you can turn off this behavior by setting `badmin_perfmon_auto: false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The `badmin_perfmon` metric source collects fata from the `badmin perfmon view -json` command. This collects [overall statistics][12] about the cluster. To collect these metrics, performance collection must be enabled on your server using the `badmin perfmon start <COLLECTION_INTERVAL>` command. By default, the integration will run this command automatically (and stop collection once the agent is turned off). However, you can turn off this behavior by setting `badmin_perfmon_auto: false`. | |
| The `badmin_perfmon` metric source collects data from the `badmin perfmon view -json` command. This collects [overall statistics][12] about the cluster. To collect these metrics, performance collection must be enabled on your server using the `badmin perfmon start <COLLECTION_INTERVAL>` command. By default, the integration runs this command automatically (and stops collection once the Agent is turned off). However, you can turn off this behavior by setting `badmin_perfmon_auto: false`. |
| | ||
| The `badmin_perfmon` metric source collects fata from the `badmin perfmon view -json` command. This collects [overall statistics][12] about the cluster. To collect these metrics, performance collection must be enabled on your server using the `badmin perfmon start <COLLECTION_INTERVAL>` command. By default, the integration will run this command automatically (and stop collection once the agent is turned off). However, you can turn off this behavior by setting `badmin_perfmon_auto: false`. | ||
| | ||
| Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact depends on the load and size of your cluster. View IBM Spectrum LSF's [recommendations][13] for managing high query load. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact depends on the load and size of your cluster. View IBM Spectrum LSF's [recommendations][13] for managing high query load. | |
| Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact interval depends on the load and size of your cluster. View IBM Spectrum LSF's [recommendations][13] for managing high query load. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exact interval? Exact number?
| | ||
| Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact depends on the load and size of your cluster. View IBM Spectrum LSF's [recommendations][13] for managing high query load. | ||
| | ||
| Similarly, the `bhist` command collects information about completed jobs, which can be query intensive so we recommend monitoring this command with the `min_collection_interval` set to 60. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Similarly, the `bhist` command collects information about completed jobs, which can be query intensive so we recommend monitoring this command with the `min_collection_interval` set to 60. | |
| Similarly, the `bhist` command collects information about completed jobs, which can be query-intensive, so we recommend monitoring this command with the `min_collection_interval` set to 60. |
What does this PR do?
Adds a readme for ibm spectrum lsf, as well as documentation for metrics. Also specifies which metrics are monitored by which parameters.
Motivation
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged