[hadoop] Add Hadoop package #2614

yug-rajani · 2022-01-28T13:55:05Z

What does this PR do?

Generated the skeleton of Hadoop integration package.
Added 3 data streams ( Application Metrics, Expanded Cluster Metrics, and Jolokia Metrics )
Added data collection logic for all 3 data streams.
Added the ingest pipeline for all the data streams.
Mapped fields according to the ECS schema and added Fields metadata in the appropriate yml files.
Added dashboards and visualizations.
Added test for pipeline for the applicable data streams.
Added system test cases for all the data streams.

Checklist

I have reviewed tips for building integrations and this pull request is aligned with them.
I have verified that all data streams collect metrics or logs.
I have added an entry to my package's changelog.yml file.
If I'm introducing a new feature, I have modified the Kibana version constraint in my package's manifest.yml file to point to the latest Elastic stack release (e.g. ^7.13.0).

How to test this PR locally

Clone integrations repo.
Install elastic-package locally.
Start elastic stack using elastic-package.
Move to integrations/packages/hadoop directory.
Run the following command to run tests.

elastic-package test

Screenshots

elasticmachine · 2022-01-28T14:01:34Z

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-03-31T08:05:28.692+0000
Duration: 16 min 47 sec

Steps errors

Expand to view the steps failures

`Test integration: hadoop`

Took 2 min 52 sec . View more details here
Description: eval "$(../../build/elastic-package stack shellinit)" ../../build/elastic-package test -v --report-format xUnit --report-output file --test-coverage

`Google Storage Download`

Took 0 min 0 sec . View more details here

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

elasticmachine · 2022-02-01T11:34:32Z

Pinging @elastic/integrations (Team:Integrations)

mtojek

Few nit-picks to clarify, otherwise, it looks fine to me.

packages/hadoop/_dev/deploy/docker/entrypoint.sh

packages/hadoop/data_stream/expanded_cluster_metrics/_dev/test/system/test-metric-config.yml

mtojek · 2022-02-09T14:30:35Z

packages/hadoop/manifest.yml

+ - monitoring
+release: beta
+conditions:
+ kibana.version: ^7.16.0 || ^8.0.0


That's probably what we'd like to adjust? cc @akshay-saraswat we target 8.0, right?

mtojek · 2022-02-09T14:33:15Z

packages/hadoop/_dev/deploy/docker/Dockerfile

@@ -0,0 +1,100 @@
+FROM centos:7


If we have to build a custom Docker image, we're trying to use the Ubuntu base image, so that's most likely something to be adjusted.

Also, a standard question, is there any official Docker image for Hadoop, we can reuse here? Did you check them?

Sure, updated the system tests to use the Ubuntu base image. Please verify.

Yes, I did check for the official Docker images for Hadoop but there is no such image that we can reuse here. That is the reason we decided to go with building custom Docker image.

What was wrong with that image: https://hub.docker.com/r/apache/hadoop/?

We tried using the same docker image but ran into this issue:
Error response from daemon: manifest for apache/hadoop:latest not found: manifest unknown: manifest unknown
So, we decided to go with building a custom Docker image.

In addition to this, after updating the base image to Ubuntu, the system tests for jolokia_metrics data stream is failing in the CI. The reason for it seems to be the unavailability of the ports. The tests are passing successfully in our local environment. Can you please guide us on which port we should use for the same?
Thanks!

The latest tag isn't present. Please take a look at available tags for apache/hadoop.

I can download apache/hadoop:2 and apache/hadoop:3.

Tried it with the images apache/hadoop:2 and apache/hadoop:3, the images got downloaded but when we try running a container it gets exited as we checked the status with docker ps -a which showed us that the container got exited. When we tried opening the logs for the exited container, the logs are empty. Do you have any other suggestions over the same?

You should jump into the image and check what's inside the starter.sh. For example:

docker run -it apache/hadoop:3 sh sh-4.2$ cat /opt/starter.sh

Then, you can look for any online help for it, and find a post on flokkr, which contains a ready docker-compose setup.

For example to start a data node, which will fail in standalone setup:

docker run -it apache/hadoop:3 hdfs datanode

Thank you for the guidance. Let me try these things out.

The following two are the errors that we are facing with the system tests using official Docker image for Hadoop:

localhost: ssh: connect to host localhost port 22: Network is unreachable

ERROR: Cannot set priority of resourcemanager process 138

Why do you need the SSH configuration there?

yug-rajani · 2022-02-15T14:14:45Z

/test

mtojek · 2022-03-01T08:11:46Z

packages/hadoop/_dev/deploy/docker/entrypoint.sh

@@ -0,0 +1,23 @@
+#!/bin/bash


What's wrong with the original entrypoint? Did you try to replace (via mount) the hadoop-env.sh file?

mtojek · 2022-03-01T08:13:26Z

packages/hadoop/_dev/deploy/docker/entrypoint.sh

+sudo /opt/hadoop/bin/hdfs namenode -format
+sudo /opt/hadoop/sbin/start-dfs.sh
+export PDSH_RCMD_TYPE=ssh
+sudo /opt/hadoop/sbin/start-yarn.sh


Doesn't yarn start in the original entrypoint?

mtojek

@yug-elastic I tried to use apache/hadoop:3 image and didn't face any serious problems. My setup:

config:

CORE-SITE.XML_fs.default.name=hdfs://namenode:9000 CORE-SITE.XML_fs.defaultFS=hdfs://namenode:9000 HDFS-SITE.XML_dfs.namenode.rpc-address=namenode:9000 HDFS-SITE.XML_dfs.replication=1 LOG4J.PROPERTIES_log4j.rootLogger=INFO, stdout LOG4J.PROPERTIES_log4j.appender.stdout=org.apache.log4j.ConsoleAppender LOG4J.PROPERTIES_log4j.appender.stdout.layout=org.apache.log4j.PatternLayout LOG4J.PROPERTIES_log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n MAPRED-SITE.XML_mapreduce.framework.name=yarn MAPRED-SITE.XML_yarn.app.mapreduce.am.env=HADOOP_MAPRED_HOME=/opt/hadoop MAPRED-SITE.XML_mapreduce.map.env=HADOOP_MAPRED_HOME=/opt/hadoop MAPRED-SITE.XML_mapreduce.reduce.env=HADOOP_MAPRED_HOME=/opt/hadoop YARN-SITE.XML_yarn.resourcemanager.hostname=resourcemanager YARN-SITE.XML_yarn.nodemanager.pmem-check-enabled=false YARN-SITE.XML_yarn.nodemanager.delete.debug-delay-sec=600 YARN-SITE.XML_yarn.nodemanager.vmem-check-enabled=false YARN-SITE.XML_yarn.nodemanager.aux-services=mapreduce_shuffle CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-applications=10000 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-am-resource-percent=0.1 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.queues=default CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.capacity=100 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.user-limit-factor=1 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.maximum-capacity=100 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.state=RUNNING CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_submit_applications=* CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_administer_queue=* CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.node-locality-delay=40 CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings= CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings-override.enable=false

docker-compose.yml:

version: "3" services: namenode: image: apache/hadoop:${HADOOP_VERSION} hostname: namenode command: ["hdfs", "namenode"] ports: - 50070:50070 - 9870:9870 env_file: - config environment: ENSURE_NAMENODE_DIR: "/tmp/hadoop-hadoop/dfs/name" datanode: image: apache/hadoop:${HADOOP_VERSION} command: ["hdfs", "datanode"] ports: - 9864:9864 links: - namenode env_file: - config resourcemanager: image: apache/hadoop:${HADOOP_VERSION} hostname: resourcemanager command: ["yarn", "resourcemanager"] ports: - 8088:8088 env_file: - config volumes: - ./testdata:/opt/testdata nodemanager: image: apache/hadoop:${HADOOP_VERSION} command: ["yarn","nodemanager"] ports: - 8042:8042 links: - resourcemanager - namenode env_file: - config

I haven't written any custom Dockerfile. I guess this YML can be improved with healthchecks. Anyways, the JMX is alredy exposed.

mtojek · 2022-03-04T10:06:04Z

packages/hadoop/_dev/deploy/docker/docker-compose.yml

+ hadoop:
+ build:
+ context: ./Dockerfiles
+ dockerfile: Dockerfile-namenode


@yug-elastic Why do we need a custom Dockerfile? Is there something wrong with the original entry point? Correct if I'm wrong, but JMX was already exposed there?

You're right, but we are using Jolokia which, as per our understanding wraps JMX. That is why we have added a custom Dockerfile here inside which we setup Jolokia and configure Hadoop with it. Does that make sense?

Consider the docker-compose I provided you, without customer Dockerfile.

If you go to the http://localhost:9864/jmx, you can fetch all these metrics and don't have to install Jolokia, right? This is because there is the Metrics2 framework exposed.

Please check if that API differs much from Jolokia. We shouldn't force users to install extensions if there is native support.

WDYT?

jsoriano · 2022-03-04T12:04:17Z

packages/hadoop/manifest.yml

+ # sxSmbIUfc2SGJGCJD4I=
+ # -----END CERTIFICATE-----
+owner:
+ github: elastic/integrations


Please add an entry for this package in .github/CODEOWNERS.

Sure. Added it.

yug-rajani · 2022-03-07T09:23:51Z

/test

mtojek

It was missed in previous rounds but spotted in Apache Spark PR, and applies also here:

We need to split metrics into logical areas: application and cluster is fine, but jolokia_metrics doesn't refer to an area, but rather a message channel. To be honest, we should iterate also on Cassandra as it looks poor compared to this integration, but let's leave it for another story.

mtojek · 2022-03-14T08:33:45Z

packages/hadoop/data_stream/application_metrics/fields/fields.yml

+ type: group
+ release: beta
+ fields:
+ - name: application_metrics


I don't understand these metrics here. They look like random data points put in the same bucket. What exactly does memory_seconds mean? Why in the same group do we have time, running containers, vcore (?), and progress?

memory_seconds = The amount of memory the application has allocated
We have put these fields in the same bucket with reference to the PRD. Please check this link.

mtojek · 2022-03-14T08:34:43Z

packages/hadoop/data_stream/expanded_cluster_metrics/fields/fields.yml

@@ -0,0 +1,72 @@
+- name: hadoop.metrics


This feedback applies to all metrics published - we need descriptions. How the end-user will know which field means what?

Makes sense, we'll update the same.

yug-rajani · 2022-03-15T06:43:19Z

It was missed in previous rounds but spotted in Apache Spark PR, and applies also here:

We need to split metrics into logical areas: application and cluster is fine, but jolokia_metrics doesn't refer to an area, but rather a message channel. To be honest, we should iterate also on Cassandra as it looks poor compared to this integration, but let's leave it for another story.

Okay, we'll need to iterate on this then. As of now, we have the following types of metrics as a part of jolokia_metrics:
NameNode, DataNode, Cluster, Node Manager
Let me know what you think of these groups:
[NameNode], [DataNode], [Cluster, Expanded Cluster Metrics], [Node Manager]
The existing 'expanded_cluster_metrics' can be clubbed, but we'll need to move it's data collection approach from httpjson to http metricbeat module.

mtojek · 2022-03-15T09:58:35Z

It sounds good to me, but don't hesitate to split it into smaller data streams if there are logical reasons to do this.

The existing 'expanded_cluster_metrics' can be clubbed, but we'll need to move it's data collection approach from httpjson to http metricbeat module.

As long as it doesn't require any development in Metricbeat, I'm good with that as well.

yug-rajani · 2022-03-15T10:17:07Z

As long as it doesn't require any development in Metricbeat, I'm good with that as well.

I don't think it would require any development in Metricbeat. What do you recommend?
Do you think keeping cluster_metrics and expanded_cluster_metrics as two separate data streams would make sense?

mtojek · 2022-03-15T11:09:42Z

Thanks for passing links. There aren't too many metrics, so I would merge those into a single cluster metrics data stream. Unless it's a complex operation to merge them, then let's keep them separate :)

ruflin · 2022-03-30T07:05:55Z

Based on the previous discussion, it seems still some refactoring on the metrics side is needed. I would recommend the same approach for this PR as discussed in #2811 (comment)

yug-rajani · 2022-03-31T10:04:17Z

Here are the links to the PRs that this PR is split into:

ruflin · 2022-03-31T11:05:34Z

Same question as for the other PR's: Which one should we start with review as the foundation? Lets put the others in draft.

akshay-saraswat · 2022-04-27T19:16:27Z

packages/hadoop/manifest.yml

+title: "Hadoop"
+version: 0.1.0
+license: basic
+description: "This Elastic integration collects metrics from hadoop."


Collect metrics from Apache Hadoop with Elastic Agent.
Let's be consistent with other integrations. This description shows up as a summary of the integration tile in the integrations UI. Otherwise, this PR looks good to me.

Thanks for the approval, @akshay-saraswat!
Makes sense, I have made the description consistent as a part of this PR.
Quick reference: https://github.com/elastic/integrations/pull/2953/files#diff-b6b7daea47c22ecde7384eb883e3f257488dc5ce3bae2bc1ecbdd7097b93df7aR6

yug-rajani · 2022-05-12T08:39:57Z

Closing this PR as it was split up into multiple PRs as discussed in the comment #2614 (comment). All the parts are now merged and the linked issue (#1543) has been closed.

Thanks a lot @mtojek, @akshay-saraswat and @lalit-satapathy for taking out time to review the PRs and providing valuable feedback!

Initial commit for Hadoop package

8d5ec52

yug-rajani force-pushed the package_hadoop branch from db9f223 to 8d5ec52 Compare January 28, 2022 13:57

yug-rajani added 4 commits January 28, 2022 19:47

Format and check the package

8104d9f

Make minor changes to versions

104be5f

Update path for docker

4786cd3

Update centos version in Dockerfile

ba2c8a0

yug-rajani self-assigned this Feb 1, 2022

yug-rajani requested a review from mtojek February 1, 2022 05:31

yug-rajani added Integration:hadoop Hadoop New Integration Issue or pull request for creating a new integration package. enhancement New feature or request Team:Integrations Label for the Integrations team labels Feb 1, 2022

yug-rajani linked an issue Feb 2, 2022 that may be closed by this pull request

Create Hadoop integration #1543

Closed

16 tasks

mtojek reviewed Feb 9, 2022

View reviewed changes

Update system tests for Hadoop to use Ubuntu base image

cdc1f29

yug-rajani added 4 commits February 15, 2022 22:00

Update generated sample event files

e1ed1d0

Update ports for system tests and generated sample events

c4ff37a

Format and check the package

3c392da

Update system tests with official Docker imag

b6eb8c5

mtojek reviewed Mar 1, 2022

View reviewed changes

yug-rajani added 3 commits March 2, 2022 15:53

Update system tests

2f84c1b

Delete tests for 2 data streams for checking with CI

4f3bedb

Update system and pipeline tests

c583b49

yug-rajani requested a review from a team as a code owner March 4, 2022 08:39

Rebuild README.md

9ce91a7

mtojek reviewed Mar 4, 2022

View reviewed changes

jsoriano reviewed Mar 4, 2022

View reviewed changes

yug-rajani and others added 2 commits March 7, 2022 12:03

Merge branch 'elastic:main' into package_hadoop

c85c75e

Add an entry for package in .github/CODEOWNERS

4d388ef

mtojek suggested changes Mar 14, 2022

View reviewed changes

mtojek requested a review from ruflin March 29, 2022 08:42

Merge branch 'elastic:main' into package_hadoop

d56e5d3

Update Jolokia based approach to JMX based approach

77df6ff

yug-rajani marked this pull request as draft March 31, 2022 10:04

akshay-saraswat reviewed Apr 27, 2022

View reviewed changes

This was referenced May 6, 2022

[oracle_weblogic] Add Integration Package with Admin Server Logs Data Stream #3066

Merged

[IBM WebSphere Application Server] Add WebSphere Application Server Integration Package with JDBC data stream #2922

Merged

yug-rajani closed this May 12, 2022

[hadoop] Add Hadoop package #2614

[hadoop] Add Hadoop package #2614

Uh oh!

Conversation

yug-rajani commented Jan 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist

How to test this PR locally

Screenshots

elasticmachine commented Jan 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💔 Build Failed

Build stats

Steps errors

Test integration: hadoop

Google Storage Download

🤖 GitHub comments

elasticmachine commented Feb 1, 2022

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yug-rajani Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yug-rajani commented Feb 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtojek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yug-rajani Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

yug-rajani commented Mar 7, 2022

mtojek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yug-rajani commented Mar 15, 2022

mtojek commented Mar 15, 2022

yug-rajani commented Mar 15, 2022

mtojek commented Mar 15, 2022

ruflin commented Mar 30, 2022

yug-rajani commented Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ruflin commented Mar 31, 2022

akshay-saraswat Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yug-rajani commented May 12, 2022

Labels

6 participants

yug-rajani commented Jan 28, 2022 •

edited

Loading

elasticmachine commented Jan 28, 2022 •

edited

Loading

`Test integration: hadoop`

`Google Storage Download`

yug-rajani Feb 22, 2022 •

edited

Loading

yug-rajani Mar 7, 2022 •

edited

Loading

yug-rajani commented Mar 31, 2022 •

edited

Loading

akshay-saraswat Apr 27, 2022 •

edited

Loading