Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ All notable changes to this project will be documented in this file.
- The built-in Prometheus servlet is now enabled and metrics are exposed under the `/prom` path of all UI services ([#695]).
- Add several properties to `hdfs-site.xml` and `core-site.xml` that improve general performance and reliability ([#696]).
- Add RBAC rule to helm template for automatic cluster domain detection ([#699]).
- Add `prometheus.io/path|port|scheme` annotations to metrics service ([#721]).

### Changed

Expand All @@ -48,6 +49,9 @@ All notable changes to this project will be documented in this file.
- The CLI argument `--kubernetes-node-name` or env variable `KUBERNETES_NODE_NAME` needs to be set. The helm-chart takes care of this.
- The operator helm-chart now grants RBAC `patch` permissions on `events.k8s.io/events`,
so events can be aggregated (e.g. "error happened 10 times over the last 5 minutes") ([#700]).
- BREAKING: Renamed headless rolegroup service from `<stacklet>-<role>-<rolegroup>` to `<stacklet>-<role>-<rolegroup>-metrics` ([#721]).
- The `prometheus.io/scrape` label was moved to the metrics service
- The headless service now only exposes product / data ports, the metrics service only metrics ports

### Fixed

Expand Down Expand Up @@ -76,6 +80,7 @@ All notable changes to this project will be documented in this file.
[#697]: https://github.com/stackabletech/hdfs-operator/pull/697
[#699]: https://github.com/stackabletech/hdfs-operator/pull/699
[#700]: https://github.com/stackabletech/hdfs-operator/pull/700
[#721]: https://github.com/stackabletech/hdfs-operator/pull/721

## [25.3.0] - 2025-03-21

Expand Down
20 changes: 14 additions & 6 deletions docs/modules/hdfs/pages/usage-guide/monitoring.adoc
Original file line number Diff line number Diff line change
@@ -1,17 +1,25 @@
= Monitoring
:description: The HDFS cluster can be monitored with Prometheus from inside or outside the K8S cluster.
:description: The HDFS cluster is automatically configured to export Prometheus metrics.

The cluster can be monitored with Prometheus from inside or outside the K8S cluster.

All services (with the exception of the Zookeeper daemon on the node names) run with the JMX exporter agent enabled and expose metrics on the `metrics` port.
This port is available from the container level up to the NodePort services.
The managed HDFS stacklets are automatically configured to export Prometheus metrics.
See xref:operators:monitoring.adoc[] for more details.

[IMPORTANT]
====
Starting with Stackable Data Platform 25.7, the built-in Prometheus metrics are also available at the `/prom` endpoint of all the UI services.
Starting with Stackable Data Platform 25.7, the built-in Prometheus metrics are available at the `/prom` endpoint of all the UI services.
The JMX exporter metrics are now deprecated and will be removed in a future release.
====

The metrics endpoints are also used as liveliness probes by Kubernetes.
This endpoint, in the case of the Namenode service, is reachable via the the `metrics` service:
[source,shell]
----
http://<hdfs-stacklet>-namenode-<rolegroup-name>-metrics:9870/prom
----

See xref:operators:monitoring.adoc[] for more details.
== Authentication when using TLS

HDFS exposes metrics through the same port as their web UI. Hence, when configuring HDFS with TLS the metrics are also secured by TLS,
and the clients scraping the metrics endpoint need to authenticate against it. This could for example be accomplished by utilizing mTLS
between Kubernetes Pods with the xref:home:secret-operator:index.adoc[Secret Operator].
54 changes: 30 additions & 24 deletions rust/operator-binary/src/container.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ use stackable_operator::{
CustomContainerLogConfig,
},
},
role_utils::RoleGroupRef,
utils::{COMMON_BASH_TRAP_FUNCTIONS, cluster_info::KubernetesClusterInfo},
};
use strum::{Display, EnumDiscriminants, IntoStaticStr};
Expand Down Expand Up @@ -216,24 +217,25 @@ impl ContainerConfig {
hdfs: &v1alpha1::HdfsCluster,
cluster_info: &KubernetesClusterInfo,
role: &HdfsNodeRole,
role_group: &str,
rolegroup_ref: &RoleGroupRef<v1alpha1::HdfsCluster>,
resolved_product_image: &ResolvedProductImage,
merged_config: &AnyNodeConfig,
env_overrides: Option<&BTreeMap<String, String>>,
zk_config_map_name: &str,
object_name: &str,
namenode_podrefs: &[HdfsPodRef],
labels: &Labels,
) -> Result<(), Error> {
// HDFS main container
let main_container_config = Self::from(*role);
pb.add_volumes(main_container_config.volumes(merged_config, object_name, labels)?)
let object_name = rolegroup_ref.object_name();

pb.add_volumes(main_container_config.volumes(merged_config, &object_name, labels)?)
.context(AddVolumeSnafu)?;
pb.add_container(main_container_config.main_container(
hdfs,
cluster_info,
role,
role_group,
rolegroup_ref,
resolved_product_image,
zk_config_map_name,
env_overrides,
Expand Down Expand Up @@ -277,6 +279,8 @@ impl ContainerConfig {
)
.with_pod_scope()
.with_node_scope()
// To scrape metrics behind TLS endpoint (without FQDN)
.with_service_scope(rolegroup_ref.rolegroup_metrics_service_name())
.with_format(SecretFormat::TlsPkcs12)
.with_tls_pkcs12_password(TLS_STORE_PASSWORD)
.with_auto_tls_cert_lifetime(
Expand Down Expand Up @@ -319,15 +323,15 @@ impl ContainerConfig {
let zkfc_container_config = Self::try_from(NameNodeContainer::Zkfc.to_string())?;
pb.add_volumes(zkfc_container_config.volumes(
merged_config,
object_name,
&object_name,
labels,
)?)
.context(AddVolumeSnafu)?;
pb.add_container(zkfc_container_config.main_container(
hdfs,
cluster_info,
role,
role_group,
rolegroup_ref,
resolved_product_image,
zk_config_map_name,
env_overrides,
Expand All @@ -340,15 +344,15 @@ impl ContainerConfig {
Self::try_from(NameNodeContainer::FormatNameNodes.to_string())?;
pb.add_volumes(format_namenodes_container_config.volumes(
merged_config,
object_name,
&object_name,
labels,
)?)
.context(AddVolumeSnafu)?;
pb.add_init_container(format_namenodes_container_config.init_container(
hdfs,
cluster_info,
role,
role_group,
&rolegroup_ref.role_group,
resolved_product_image,
zk_config_map_name,
env_overrides,
Expand All @@ -362,15 +366,15 @@ impl ContainerConfig {
Self::try_from(NameNodeContainer::FormatZooKeeper.to_string())?;
pb.add_volumes(format_zookeeper_container_config.volumes(
merged_config,
object_name,
&object_name,
labels,
)?)
.context(AddVolumeSnafu)?;
pb.add_init_container(format_zookeeper_container_config.init_container(
hdfs,
cluster_info,
role,
role_group,
&rolegroup_ref.role_group,
resolved_product_image,
zk_config_map_name,
env_overrides,
Expand All @@ -385,15 +389,15 @@ impl ContainerConfig {
Self::try_from(DataNodeContainer::WaitForNameNodes.to_string())?;
pb.add_volumes(wait_for_namenodes_container_config.volumes(
merged_config,
object_name,
&object_name,
labels,
)?)
.context(AddVolumeSnafu)?;
pb.add_init_container(wait_for_namenodes_container_config.init_container(
hdfs,
cluster_info,
role,
role_group,
&rolegroup_ref.role_group,
resolved_product_image,
zk_config_map_name,
env_overrides,
Expand Down Expand Up @@ -462,7 +466,7 @@ impl ContainerConfig {
hdfs: &v1alpha1::HdfsCluster,
cluster_info: &KubernetesClusterInfo,
role: &HdfsNodeRole,
role_group: &str,
rolegroup_ref: &RoleGroupRef<v1alpha1::HdfsCluster>,
resolved_product_image: &ResolvedProductImage,
zookeeper_config_map_name: &str,
env_overrides: Option<&BTreeMap<String, String>>,
Expand All @@ -481,7 +485,7 @@ impl ContainerConfig {
.args(self.args(hdfs, cluster_info, role, merged_config, &[])?)
.add_env_vars(self.env(
hdfs,
role_group,
&rolegroup_ref.role_group,
zookeeper_config_map_name,
env_overrides,
resources.as_ref(),
Expand Down Expand Up @@ -1249,16 +1253,18 @@ wait_for_termination $!
/// Container ports for the main containers namenode, datanode and journalnode.
fn container_ports(&self, hdfs: &v1alpha1::HdfsCluster) -> Vec<ContainerPort> {
match self {
ContainerConfig::Hdfs { role, .. } => hdfs
.ports(role)
.into_iter()
.map(|(name, value)| ContainerPort {
name: Some(name),
container_port: i32::from(value),
protocol: Some("TCP".to_string()),
..ContainerPort::default()
})
.collect(),
ContainerConfig::Hdfs { role, .. } => {
// data ports
hdfs.hdfs_main_container_ports(role)
.into_iter()
.map(|(name, value)| ContainerPort {
name: Some(name),
container_port: i32::from(value),
protocol: Some("TCP".to_string()),
..ContainerPort::default()
})
.collect()
}
_ => {
vec![]
}
Expand Down
7 changes: 7 additions & 0 deletions rust/operator-binary/src/crd/constants.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,28 @@ pub const SERVICE_PORT_NAME_HTTP: &str = "http";
pub const SERVICE_PORT_NAME_HTTPS: &str = "https";
pub const SERVICE_PORT_NAME_DATA: &str = "data";
pub const SERVICE_PORT_NAME_METRICS: &str = "metrics";
pub const SERVICE_PORT_NAME_JMX_METRICS: &str = "jmx-metrics";

pub const DEFAULT_LISTENER_CLASS: &str = "cluster-internal";

pub const DEFAULT_NAME_NODE_METRICS_PORT: u16 = 8183;
pub const DEFAULT_NAME_NODE_NATIVE_METRICS_HTTP_PORT: u16 = 9870;
pub const DEFAULT_NAME_NODE_NATIVE_METRICS_HTTPS_PORT: u16 = 9871;
pub const DEFAULT_NAME_NODE_HTTP_PORT: u16 = 9870;
pub const DEFAULT_NAME_NODE_HTTPS_PORT: u16 = 9871;
pub const DEFAULT_NAME_NODE_RPC_PORT: u16 = 8020;

pub const DEFAULT_DATA_NODE_METRICS_PORT: u16 = 8082;
pub const DEFAULT_DATA_NODE_NATIVE_METRICS_HTTP_PORT: u16 = 9864;
pub const DEFAULT_DATA_NODE_NATIVE_METRICS_HTTPS_PORT: u16 = 9865;
pub const DEFAULT_DATA_NODE_HTTP_PORT: u16 = 9864;
pub const DEFAULT_DATA_NODE_HTTPS_PORT: u16 = 9865;
pub const DEFAULT_DATA_NODE_DATA_PORT: u16 = 9866;
pub const DEFAULT_DATA_NODE_IPC_PORT: u16 = 9867;

pub const DEFAULT_JOURNAL_NODE_METRICS_PORT: u16 = 8081;
pub const DEFAULT_JOURNAL_NODE_NATIVE_METRICS_HTTP_PORT: u16 = 8480;
pub const DEFAULT_JOURNAL_NODE_NATIVE_METRICS_HTTPS_PORT: u16 = 8481;
pub const DEFAULT_JOURNAL_NODE_HTTP_PORT: u16 = 8480;
pub const DEFAULT_JOURNAL_NODE_HTTPS_PORT: u16 = 8481;
pub const DEFAULT_JOURNAL_NODE_RPC_PORT: u16 = 8485;
Expand Down
Loading