Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector Guglielmo Iozzia, Big Data Infrastructure Engineer @ IBM Ireland

Data Ingestion for Analytics: a real scenario In the business area (cloud applications) to which my team belongs there were so many questions to be answered. They were related to: ● Defect analysis ● Outage analysis ● Cyber-Security

“Data is the second most important thing in analytics”

Data Ingestion: multiple sources... ● Legacy systems ● DB2 ● Lotus Domino ● MongoDB ● Application logs ● System logs ● New Relic ● Jenkins pipelines ● Testing tools output ● RESTful Services

… and so many tools available to get the data

What are we going to do with all those data?

Issues ● The need to collect data from multiple sources introduces redundancy, which costs additional disk space and increases query times. ● A small team. ● Lack of skills and experience across the team (and the business area in general) in managing Big Data tools. ● Low budget.

Alternatives #2 Cloning team members

Alternatives #3 Find a smart way to simplify the data ingestion process

A single tool needed... ● Design complex data flows with minimal coding and the maximum flexibility. ● Provide real-time data flow statistics, metrics for each flow stage. ● Automated error handling and alerting. ● Easy to use by everyone. ● Zero-downtime when upgrading the infrastructure due to logical isolation of each flow stage. ● Open Source

Streamsets Data Collector: supported origins

Streamsets Data Collector: available destinations

Streamsets Data Collector: available processors ● Base64 Field Decoder ● Base64 Field Encoder ● Expression Evaluator ● Field Converter ● JavaScript Evaluator ● JSON Parser ● Jython Evaluator ● Log Parser ● Stream Selector ● XML Parser ...and many others

Streamsets Data Collector Demo

Streamsets DC: performance and reliability ● Two available execution modes: standalone or cluster ● Implemented in Java: so any performance best practice/recommendation for Java applications applies here ● REST services for performance monitoring available ● Rules and alerts (metric and data both)

Streamsets Data Collector: security ● You can authenticate user accounts based on LDAP ● Authorization: the Data Collector provides several roles (admin, manager, creator, guest) ● You can use Kerberos authentication to connect to origin and destination systems ● Follow the usual security best practices in terms of iptables, networking, etc. for Java web applications running on Linux machines.

Useful Links Streamsets Data Collector: https://streamsets.com/product/

Thanks! My contacts: Linkedin: https://ie.linkedin.com/in/giozzia Blog: http://googlielmo.blogspot.ie/ Twitter: https://twitter.com/guglielmoiozzia

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

More Related Content

What's hot

Viewers also liked

Similar to Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Recently uploaded

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector