Skip to content

Conversation

@yug-rajani
Copy link
Contributor

@yug-rajani yug-rajani commented Jan 28, 2022

What does this PR do?

  • Generated the skeleton of Hadoop integration package.
  • Added 3 data streams ( Application Metrics, Expanded Cluster Metrics, and Jolokia Metrics )
  • Added data collection logic for all 3 data streams.
  • Added the ingest pipeline for all the data streams.
  • Mapped fields according to the ECS schema and added Fields metadata in the appropriate yml files.
  • Added dashboards and visualizations.
  • Added test for pipeline for the applicable data streams.
  • Added system test cases for all the data streams.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • If I'm introducing a new feature, I have modified the Kibana version constraint in my package's manifest.yml file to point to the latest Elastic stack release (e.g. ^7.13.0).

How to test this PR locally

  • Clone integrations repo.
  • Install elastic-package locally.
  • Start elastic stack using elastic-package.
  • Move to integrations/packages/hadoop directory.
  • Run the following command to run tests.

elastic-package test

Screenshots

image
image
image
Integrations
Hadoop-Integration
image

@elasticmachine
Copy link

elasticmachine commented Jan 28, 2022

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-03-31T08:05:28.692+0000

  • Duration: 16 min 47 sec

Steps errors 2

Expand to view the steps failures

Test integration: hadoop
  • Took 2 min 52 sec . View more details here
  • Description: eval "$(../../build/elastic-package stack shellinit)" ../../build/elastic-package test -v --report-format xUnit --report-output file --test-coverage
Google Storage Download
  • Took 0 min 0 sec . View more details here

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.
@yug-rajani yug-rajani self-assigned this Feb 1, 2022
@yug-rajani yug-rajani requested a review from mtojek February 1, 2022 05:31
@yug-rajani yug-rajani added Integration:hadoop Hadoop New Integration Issue or pull request for creating a new integration package. enhancement New feature or request Team:Integrations Label for the Integrations team labels Feb 1, 2022
@elasticmachine
Copy link

Pinging @elastic/integrations (Team:Integrations)

@yug-rajani yug-rajani linked an issue Feb 2, 2022 that may be closed by this pull request
16 tasks
Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nit-picks to clarify, otherwise, it looks fine to me.

- monitoring
release: beta
conditions:
kibana.version: ^7.16.0 || ^8.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably what we'd like to adjust? cc @akshay-saraswat we target 8.0, right?

@@ -0,0 +1,100 @@
FROM centos:7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have to build a custom Docker image, we're trying to use the Ubuntu base image, so that's most likely something to be adjusted.

Also, a standard question, is there any official Docker image for Hadoop, we can reuse here? Did you check them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated the system tests to use the Ubuntu base image. Please verify.

Yes, I did check for the official Docker images for Hadoop but there is no such image that we can reuse here. That is the reason we decided to go with building custom Docker image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was wrong with that image: https://hub.docker.com/r/apache/hadoop/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried using the same docker image but ran into this issue:
Error response from daemon: manifest for apache/hadoop:latest not found: manifest unknown: manifest unknown
So, we decided to go with building a custom Docker image.

In addition to this, after updating the base image to Ubuntu, the system tests for jolokia_metrics data stream is failing in the CI. The reason for it seems to be the unavailability of the ports. The tests are passing successfully in our local environment. Can you please guide us on which port we should use for the same?
Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest tag isn't present. Please take a look at available tags for apache/hadoop.

I can download apache/hadoop:2 and apache/hadoop:3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it with the images apache/hadoop:2 and apache/hadoop:3, the images got downloaded but when we try running a container it gets exited as we checked the status with docker ps -a which showed us that the container got exited. When we tried opening the logs for the exited container, the logs are empty. Do you have any other suggestions over the same?
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should jump into the image and check what's inside the starter.sh. For example:

docker run -it apache/hadoop:3 sh sh-4.2$ cat /opt/starter.sh 

Then, you can look for any online help for it, and find a post on flokkr, which contains a ready docker-compose setup.

For example to start a data node, which will fail in standalone setup:

docker run -it apache/hadoop:3 hdfs datanode 
Copy link
Contributor Author

@yug-rajani yug-rajani Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the guidance. Let me try these things out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following two are the errors that we are facing with the system tests using official Docker image for Hadoop:

localhost: ssh: connect to host localhost port 22: Network is unreachable 
ERROR: Cannot set priority of resourcemanager process 138 
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need the SSH configuration there?

@yug-rajani
Copy link
Contributor Author

/test

@@ -0,0 +1,23 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with the original entrypoint? Did you try to replace (via mount) the hadoop-env.sh file?

sudo /opt/hadoop/bin/hdfs namenode -format
sudo /opt/hadoop/sbin/start-dfs.sh
export PDSH_RCMD_TYPE=ssh
sudo /opt/hadoop/sbin/start-yarn.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't yarn start in the original entrypoint?

Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yug-elastic I tried to use apache/hadoop:3 image and didn't face any serious problems. My setup:

config:

CORE-SITE.XML_fs.default.name=hdfs://namenode:9000 CORE-SITE.XML_fs.defaultFS=hdfs://namenode:9000 HDFS-SITE.XML_dfs.namenode.rpc-address=namenode:9000 HDFS-SITE.XML_dfs.replication=1 LOG4J.PROPERTIES_log4j.rootLogger=INFO, stdout LOG4J.PROPERTIES_log4j.appender.stdout=org.apache.log4j.ConsoleAppender LOG4J.PROPERTIES_log4j.appender.stdout.layout=org.apache.log4j.PatternLayout LOG4J.PROPERTIES_log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n MAPRED-SITE.XML_mapreduce.framework.name=yarn MAPRED-SITE.XML_yarn.app.mapreduce.am.env=HADOOP_MAPRED_HOME=/opt/hadoop MAPRED-SITE.XML_mapreduce.map.env=HADOOP_MAPRED_HOME=/opt/hadoop MAPRED-SITE.XML_mapreduce.reduce.env=HADOOP_MAPRED_HOME=/opt/hadoop YARN-SITE.XML_yarn.resourcemanager.hostname=resourcemanager YARN-SITE.XML_yarn.nodemanager.pmem-check-enabled=false YARN-SITE.XML_yarn.nodemanager.delete.debug-delay-sec=600 YARN-SITE.XML_yarn.nodemanager.vmem-check-enabled=false YARN-SITE.XML_yarn.nodemanager.aux-services=mapreduce_shuffle CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-applications=10000 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-am-resource-percent=0.1 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.queues=default CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.capacity=100 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.user-limit-factor=1 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.maximum-capacity=100 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.state=RUNNING CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_submit_applications=* CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_administer_queue=* CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.node-locality-delay=40 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings= CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings-override.enable=false 

docker-compose.yml:

version: "3" services: namenode: image: apache/hadoop:${HADOOP_VERSION} hostname: namenode command: ["hdfs", "namenode"] ports: - 50070:50070 - 9870:9870 env_file: - config environment: ENSURE_NAMENODE_DIR: "/tmp/hadoop-hadoop/dfs/name" datanode: image: apache/hadoop:${HADOOP_VERSION} command: ["hdfs", "datanode"] ports: - 9864:9864 links: - namenode env_file: - config resourcemanager: image: apache/hadoop:${HADOOP_VERSION} hostname: resourcemanager command: ["yarn", "resourcemanager"] ports: - 8088:8088 env_file: - config volumes: - ./testdata:/opt/testdata nodemanager: image: apache/hadoop:${HADOOP_VERSION} command: ["yarn","nodemanager"] ports: - 8042:8042 links: - resourcemanager - namenode env_file: - config 

I haven't written any custom Dockerfile. I guess this YML can be improved with healthchecks. Anyways, the JMX is alredy exposed.

@yug-rajani yug-rajani requested a review from a team as a code owner March 4, 2022 08:39
hadoop:
build:
context: ./Dockerfiles
dockerfile: Dockerfile-namenode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yug-elastic Why do we need a custom Dockerfile? Is there something wrong with the original entry point? Correct if I'm wrong, but JMX was already exposed there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but we are using Jolokia which, as per our understanding wraps JMX. That is why we have added a custom Dockerfile here inside which we setup Jolokia and configure Hadoop with it. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the docker-compose I provided you, without customer Dockerfile.

If you go to the http://localhost:9864/jmx, you can fetch all these metrics and don't have to install Jolokia, right? This is because there is the Metrics2 framework exposed.

Please check if that API differs much from Jolokia. We shouldn't force users to install extensions if there is native support.

WDYT?

# sxSmbIUfc2SGJGCJD4I=
# -----END CERTIFICATE-----
owner:
github: elastic/integrations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry for this package in .github/CODEOWNERS.

Copy link
Contributor Author

@yug-rajani yug-rajani Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Added it.

@yug-rajani
Copy link
Contributor Author

/test

Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was missed in previous rounds but spotted in Apache Spark PR, and applies also here:

We need to split metrics into logical areas: application and cluster is fine, but jolokia_metrics doesn't refer to an area, but rather a message channel. To be honest, we should iterate also on Cassandra as it looks poor compared to this integration, but let's leave it for another story.

type: group
release: beta
fields:
- name: application_metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand these metrics here. They look like random data points put in the same bucket. What exactly does memory_seconds mean? Why in the same group do we have time, running containers, vcore (?), and progress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory_seconds = The amount of memory the application has allocated
We have put these fields in the same bucket with reference to the PRD. Please check this link.

@@ -0,0 +1,72 @@
- name: hadoop.metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feedback applies to all metrics published - we need descriptions. How the end-user will know which field means what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, we'll update the same.

@yug-rajani
Copy link
Contributor Author

It was missed in previous rounds but spotted in Apache Spark PR, and applies also here:

We need to split metrics into logical areas: application and cluster is fine, but jolokia_metrics doesn't refer to an area, but rather a message channel. To be honest, we should iterate also on Cassandra as it looks poor compared to this integration, but let's leave it for another story.

Okay, we'll need to iterate on this then. As of now, we have the following types of metrics as a part of jolokia_metrics:
NameNode, DataNode, Cluster, Node Manager
Let me know what you think of these groups:
[NameNode], [DataNode], [Cluster, Expanded Cluster Metrics], [Node Manager]
The existing 'expanded_cluster_metrics' can be clubbed, but we'll need to move it's data collection approach from httpjson to http metricbeat module.

@mtojek
Copy link
Contributor

mtojek commented Mar 15, 2022

It sounds good to me, but don't hesitate to split it into smaller data streams if there are logical reasons to do this.

The existing 'expanded_cluster_metrics' can be clubbed, but we'll need to move it's data collection approach from httpjson to http metricbeat module.

As long as it doesn't require any development in Metricbeat, I'm good with that as well.

@yug-rajani
Copy link
Contributor Author

As long as it doesn't require any development in Metricbeat, I'm good with that as well.

I don't think it would require any development in Metricbeat. What do you recommend?
Do you think keeping cluster_metrics and expanded_cluster_metrics as two separate data streams would make sense?

@mtojek
Copy link
Contributor

mtojek commented Mar 15, 2022

Thanks for passing links. There aren't too many metrics, so I would merge those into a single cluster metrics data stream. Unless it's a complex operation to merge them, then let's keep them separate :)

@mtojek mtojek requested a review from ruflin March 29, 2022 08:42
@ruflin
Copy link
Contributor

ruflin commented Mar 30, 2022

Based on the previous discussion, it seems still some refactoring on the metrics side is needed. I would recommend the same approach for this PR as discussed in #2811 (comment)

@ruflin
Copy link
Contributor

ruflin commented Mar 31, 2022

Same question as for the other PR's: Which one should we start with review as the foundation? Lets put the others in draft.

title: "Hadoop"
version: 0.1.0
license: basic
description: "This Elastic integration collects metrics from hadoop."
Copy link
Contributor

@akshay-saraswat akshay-saraswat Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collect metrics from Apache Hadoop with Elastic Agent.
Let's be consistent with other integrations. This description shows up as a summary of the integration tile in the integrations UI. Otherwise, this PR looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the approval, @akshay-saraswat!
Makes sense, I have made the description consistent as a part of this PR.
Quick reference: https://github.com/elastic/integrations/pull/2953/files#diff-b6b7daea47c22ecde7384eb883e3f257488dc5ce3bae2bc1ecbdd7097b93df7aR6

@yug-rajani
Copy link
Contributor Author

Closing this PR as it was split up into multiple PRs as discussed in the comment #2614 (comment). All the parts are now merged and the linked issue (#1543) has been closed.

Thanks a lot @mtojek, @akshay-saraswat and @lalit-satapathy for taking out time to review the PRs and providing valuable feedback!

@yug-rajani yug-rajani closed this May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Integration:hadoop Hadoop New Integration Issue or pull request for creating a new integration package. Team:Integrations Label for the Integrations team

6 participants