© 2024 Cloudera, Inc. All rights reserved. Building Apache NiFi 2.0 Python Processors Tim Spann Principal Developer Advocate Feb 29, 2024
© 2024 Cloudera, Inc. All rights reserved. 2 Tim Spann Twitter: @PaasDev // Blog: datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://medium.com/@tspann https://github.com/tspannhw
© 2024 Cloudera, Inc. All rights reserved. 3 This week in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/ FLaNK Stack Weekly by Tim Spann
© 2024 Cloudera, Inc. All rights reserved. 4 Confidential—Restricted @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... Future of Data - NYC + NJ + Philly + Virtual
© 2024 Cloudera, Inc. All rights reserved. 5 Apache NiFi has emerged as a robust and flexible platform for designing data integration and flow management solutions. With the release of Apache NiFi 2.0, the community has introduced a host of new features, making it even more powerful and extensible. One exciting enhancement is the ability to create custom processors using Python, providing a seamless integration of Python scripts into your data flow. In this talk, I will delve into the world of Apache NiFi 2.0 Python processors, exploring the capabilities they offer and demonstrating how to build custom processors to enhance your data processing pipelines. Attendees will gain a deep understanding of the integration points between NiFi and Python, enabling them to leverage the extensive libraries and frameworks available in the Python ecosystem. – Introduction to Apache NiFi 2.0 – Python Processors Deep Dive – Build your own custom Python Processor – Integrating Python Libraries and Frameworks – Debugging and Troubleshooting By the end of this talk, participants will have a comprehensive understanding of building and optimizing Apache NiFi 2.0 Python processors, enabling them to integrate Python seamlessly into their data processing workflows. This session is suitable for data engineers, architects, and anyone interested in harnessing the combined power of Apache NiFi and Python for efficient data integration and flow management. Let’s enhance real-time streaming pipelines with smart Python code. Adding code for vector databases and LLM.
© 2024 Cloudera, Inc. All rights reserved. 6 The event is a series of pre-recorded videos, broadcasted on our YouTube channel. Your talk can go from 15 minutes (lighting talk), through 30 min (regular talk), up to 60+ min (in-depth/with demo). We are open to any type of session Once you are approved as a speaker, we will request you to record your talk. You can use any available recording tool, although we recommend OBS Studio (open source and free, compatible with iOS, Windows, and Linux). We’ll provide you with a tutorial for your convenience (set everything up in under 15 minutes) If you’d like to deliver a workshop, you can present it at a conf42 event. Due to pre-recorded nature of our events, you will need to transform it into a hands-on tutorial. Please avoid submitting the exact same talk that has already been presented at a past conf42. Instead, we encourage you to provide a continuation of your previous content (2.0 version) or offer a fresh new perspective on the same topic. For example, if you previously made a theoretical overview of a tool, consider giving a hands-on demo this time. The clarity of your voice is the most important technical aspect of a talk, lots of people will listen to it as a podcast / in the background. We recommend you use the best microphone available
© 2024 Cloudera, Inc. All rights reserved. Generative AI https://github.com/tspannhw/FLaNK-HuggingFace-DistilBert-SentimentAnalysis https://github.com/tspannhw/FLaNK-LLM watsonx.ai
© 2024 Cloudera, Inc. All rights reserved. LLM USE CASE Vector DB AI Model Unstructured file types Data in Motion on Cloudera Data Platform (CDP) Capture, process & distribute any data, anywhere Other enterprise data Open Data Lakehouse Materialized Views Structured Sources Applications/API’s Streams
© 2024 Cloudera, Inc. All rights reserved. 9 Text-to-Image AI ASSISTED DESIGN e.g. DALL-E, Midjourney GENERATIVE AI CAPABILITY Text-to-Speech INTERACTIVE VOICE RESPONSE Text-to-Text CONVERSATIONAL CHATBOTS e.g. ChatGPT, Falcon, LLaMA Common Families of Generative AI Capability Hello, How Can I help you Today?
© 2024 Cloudera, Inc. All rights reserved. 10 Some common Vector DBs Open Community & Open Models RAPID INNOVATION IN THE LLM SPACE Too much to cover today.. but you should know the common LLMs, Frameworks, Tools Notable LLMs Closed Models Open Models GPT3.5 GPT4 Llama2 Mistral7B Mixtral8x7B Claude2 ++ 100s more… check out the HuggingFace LLM Leaderboard (pretrained, domain fine-tuned, chat models, …) Code Llama Popular LLM Frameworks When to use one over the other? Use Langchain if you need a general-purpose framework with flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search) Langchain is a framework for developing apps powered by LLMs ● Python and JavaScript Libraries ● Provides modules for LLM Interface, Retrieval, & Agents LLamaIndex is a framework designed specifically for RAG apps ● Python and JavaScript Libraries ● Provides built in optimizations / techniques for advanced RAG HuggingFace is an ML community for hosting & collaborating on models, datasets, and ML applications ● Latest open source LLMs are in HuggingFace ● + great learning resources / demos https://huggingface.co/ Open Source vs Self Hosted vs SaaS option
© 2024 Cloudera, Inc. All rights reserved. 11 APPLICATIONS CLOSED-SOURCE FOUNDATION MODELS MODEL HUBS OPEN SOURCE FOUNDATION MODELS FINE-TUNED MODELS PRIVATE VECTOR STORE MANAGED VECTOR STORE CLOUD INFRASTRUCTURE Milvus, Solr* Meta (Llama 2) Applied Machine Learning Prototypes (AMPs) Cloudera Generative AI Stack Hugging Face Pinecone SPECIALIZED HARDWARE APIs: OpenAI (GPT-4 Turbo) Amazon Bedrock: Anthropic (Claude 2), Cohere… DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & INFERENCE DATA STORE & VISUALIZATION Open Data Lakehouse DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & SERVING DATA STORE & VISUALIZATION
© 2024 Cloudera, Inc. All rights reserved. Apache NiFi And Real-Time GenAI Generative AI
13 © 2023 Cloudera, Inc. All rights reserved. WatsonX.AI Granite LLM, NiFi, Kafka & Flink Kafka topics Database Machine learning Flink SQL w/ SSB Lakehouse Data Viz Monitoring Architecture in the context of Travel Advisories DataFlow / NiFi Source Source Alerting
© 2023 Cloudera, Inc. All rights reserved. 14 Live Q&A Travel Advisories Weather Reports Documents Social Media Databases Transactions Public Data Feeds S3 / Files Logs ATM Data Live Chat … HYBRID CLOUD INTERACT COLLECT STORE ENRICH, REPORT Distribute Collect Report REPORT Visualize Report, Automate AI BASED ENHANCEMENTS Predict, Automate VECTOR DATABASE LLM Machine Learning Data Visualization Data Flow Data Warehouse SQL Stream Builder Data Visualization Input Sentences Generated Text Timestamp Input Sentence Timestamps Enrichments Messaging Broker Real-time alerting Real-time alerting Aggregations
© 2024 Cloudera, Inc. All rights reserved. 15 © 2022 Cloudera, Inc. All rights reserved. ReadyFlow Gallery Leverage pre-built flow templates to quickly customize and deploy new data flows
© 2019 Cloudera, Inc. All rights reserved. 16 Cloudera + LLMs Knowledge Repository Data Storage / Management Data Preparation Data Engineering LLM Fine Tuning Process Training Framework LLM Serving Serving Framework Key: CPU Task GPU Task CML CDE CDP Vector DB CDF Streaming Classification Real-Time Model Deployment
© 2024 Cloudera, Inc. All rights reserved. https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 NiFi 2.0.0 Features ● Python Integration ● Parameters ● JDK 21+ ● JSON Flow Serialization ● Rules Engine for Development Assistance ● Run Process Group as Stateless ● flow.json.gz https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
© 2024 Cloudera, Inc. All rights reserved. Python Processors
© 2024 Cloudera, Inc. All rights reserved. Basics
© 2024 Cloudera, Inc. All rights reserved. Basics
© 2024 Cloudera, Inc. All rights reserved. Basics
© 2024 Cloudera, Inc. All rights reserved. Extract Company Names ● Python 3.10+ ● HuggingFace, NLP, SpaCY, PyTorch https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
© 2024 Cloudera, Inc. All rights reserved. Get Compound GTFS Data ● Python 3.10+ ● GTFS to JSON https://github.com/tspannhw/FLaNK-python-processors/blob/main/GetGTFSCompoundFeed.py
© 2024 Cloudera, Inc. All rights reserved. Extract Text from Web VTT ● Python 3.10+ ● Web VTT to Text ● Web Video Text Tracks Format Extractor https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API https://github.com/tspannhw/FLaNK-python-processors/blob/main/TranslateWebVTT.py WEBVTT 1 00:00:06.066 --> 00:00:07.166 Now let's talk about 2 00:00:07.166 --> 00:00:12.033 data retrieval, views, and materialized views.
© 2024 Cloudera, Inc. All rights reserved. WatsonX SDK To Foundation ● Python 3.10+ ● LLM ● WatsonX.AI Foundation Models ● Inference ● Secure ● Official SDK from IBM https://github.com/tspannhw/FLaNK-python-watsonx-processor
© 2024 Cloudera, Inc. All rights reserved. System / Process Monitoring ● Python 3.10+ ● psutil ● Swap memory, disk, networks
© 2024 Cloudera, Inc. All rights reserved. Generate Synthetic Records w/ Faker ● Python 3.10+ ● faker ● Choose as many as you want ● Attribute output
© 2024 Cloudera, Inc. All rights reserved. Download a Wiki Page as HTML or WikiFormat (Text) ● Python 3.10+ ● Wikipedia-api ● HTML or Text ● Choose your wiki page dynamically
© 2024 Cloudera, Inc. All rights reserved. Get GTFS Data ● Python 3.10+ ● GTFS from Transit URL ● Alerts, Trip Updates or Vehicle Positions ● Returns JSON ● google.transit and google.protobuf
© 2024 Cloudera, Inc. All rights reserved. Other Python Processors ● Updated Pinecone (Vector DB Interface) ● ChunkDocument, ParseDocument ● ConvertCSVtoExcel ● DetectObjectInImage ● PromptChatGPT ● PutChroma, QueryChroma (Vector DB Interface)
© 2024 Cloudera, Inc. All rights reserved.
© 2024 Cloudera, Inc. All rights reserved. DEMO
33 TH N Y U

Conf42-Python-Building Apache NiFi 2.0 Python Processors

  • 1.
    © 2024 Cloudera,Inc. All rights reserved. Building Apache NiFi 2.0 Python Processors Tim Spann Principal Developer Advocate Feb 29, 2024
  • 2.
    © 2024 Cloudera,Inc. All rights reserved. 2 Tim Spann Twitter: @PaasDev // Blog: datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://medium.com/@tspann https://github.com/tspannhw
  • 3.
    © 2024 Cloudera,Inc. All rights reserved. 3 This week in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/ FLaNK Stack Weekly by Tim Spann
  • 4.
    © 2024 Cloudera,Inc. All rights reserved. 4 Confidential—Restricted @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... Future of Data - NYC + NJ + Philly + Virtual
  • 5.
    © 2024 Cloudera,Inc. All rights reserved. 5 Apache NiFi has emerged as a robust and flexible platform for designing data integration and flow management solutions. With the release of Apache NiFi 2.0, the community has introduced a host of new features, making it even more powerful and extensible. One exciting enhancement is the ability to create custom processors using Python, providing a seamless integration of Python scripts into your data flow. In this talk, I will delve into the world of Apache NiFi 2.0 Python processors, exploring the capabilities they offer and demonstrating how to build custom processors to enhance your data processing pipelines. Attendees will gain a deep understanding of the integration points between NiFi and Python, enabling them to leverage the extensive libraries and frameworks available in the Python ecosystem. – Introduction to Apache NiFi 2.0 – Python Processors Deep Dive – Build your own custom Python Processor – Integrating Python Libraries and Frameworks – Debugging and Troubleshooting By the end of this talk, participants will have a comprehensive understanding of building and optimizing Apache NiFi 2.0 Python processors, enabling them to integrate Python seamlessly into their data processing workflows. This session is suitable for data engineers, architects, and anyone interested in harnessing the combined power of Apache NiFi and Python for efficient data integration and flow management. Let’s enhance real-time streaming pipelines with smart Python code. Adding code for vector databases and LLM.
  • 6.
    © 2024 Cloudera,Inc. All rights reserved. 6 The event is a series of pre-recorded videos, broadcasted on our YouTube channel. Your talk can go from 15 minutes (lighting talk), through 30 min (regular talk), up to 60+ min (in-depth/with demo). We are open to any type of session Once you are approved as a speaker, we will request you to record your talk. You can use any available recording tool, although we recommend OBS Studio (open source and free, compatible with iOS, Windows, and Linux). We’ll provide you with a tutorial for your convenience (set everything up in under 15 minutes) If you’d like to deliver a workshop, you can present it at a conf42 event. Due to pre-recorded nature of our events, you will need to transform it into a hands-on tutorial. Please avoid submitting the exact same talk that has already been presented at a past conf42. Instead, we encourage you to provide a continuation of your previous content (2.0 version) or offer a fresh new perspective on the same topic. For example, if you previously made a theoretical overview of a tool, consider giving a hands-on demo this time. The clarity of your voice is the most important technical aspect of a talk, lots of people will listen to it as a podcast / in the background. We recommend you use the best microphone available
  • 7.
    © 2024 Cloudera,Inc. All rights reserved. Generative AI https://github.com/tspannhw/FLaNK-HuggingFace-DistilBert-SentimentAnalysis https://github.com/tspannhw/FLaNK-LLM watsonx.ai
  • 8.
    © 2024 Cloudera,Inc. All rights reserved. LLM USE CASE Vector DB AI Model Unstructured file types Data in Motion on Cloudera Data Platform (CDP) Capture, process & distribute any data, anywhere Other enterprise data Open Data Lakehouse Materialized Views Structured Sources Applications/API’s Streams
  • 9.
    © 2024 Cloudera,Inc. All rights reserved. 9 Text-to-Image AI ASSISTED DESIGN e.g. DALL-E, Midjourney GENERATIVE AI CAPABILITY Text-to-Speech INTERACTIVE VOICE RESPONSE Text-to-Text CONVERSATIONAL CHATBOTS e.g. ChatGPT, Falcon, LLaMA Common Families of Generative AI Capability Hello, How Can I help you Today?
  • 10.
    © 2024 Cloudera,Inc. All rights reserved. 10 Some common Vector DBs Open Community & Open Models RAPID INNOVATION IN THE LLM SPACE Too much to cover today.. but you should know the common LLMs, Frameworks, Tools Notable LLMs Closed Models Open Models GPT3.5 GPT4 Llama2 Mistral7B Mixtral8x7B Claude2 ++ 100s more… check out the HuggingFace LLM Leaderboard (pretrained, domain fine-tuned, chat models, …) Code Llama Popular LLM Frameworks When to use one over the other? Use Langchain if you need a general-purpose framework with flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search) Langchain is a framework for developing apps powered by LLMs ● Python and JavaScript Libraries ● Provides modules for LLM Interface, Retrieval, & Agents LLamaIndex is a framework designed specifically for RAG apps ● Python and JavaScript Libraries ● Provides built in optimizations / techniques for advanced RAG HuggingFace is an ML community for hosting & collaborating on models, datasets, and ML applications ● Latest open source LLMs are in HuggingFace ● + great learning resources / demos https://huggingface.co/ Open Source vs Self Hosted vs SaaS option
  • 11.
    © 2024 Cloudera,Inc. All rights reserved. 11 APPLICATIONS CLOSED-SOURCE FOUNDATION MODELS MODEL HUBS OPEN SOURCE FOUNDATION MODELS FINE-TUNED MODELS PRIVATE VECTOR STORE MANAGED VECTOR STORE CLOUD INFRASTRUCTURE Milvus, Solr* Meta (Llama 2) Applied Machine Learning Prototypes (AMPs) Cloudera Generative AI Stack Hugging Face Pinecone SPECIALIZED HARDWARE APIs: OpenAI (GPT-4 Turbo) Amazon Bedrock: Anthropic (Claude 2), Cohere… DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & INFERENCE DATA STORE & VISUALIZATION Open Data Lakehouse DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & SERVING DATA STORE & VISUALIZATION
  • 12.
    © 2024 Cloudera,Inc. All rights reserved. Apache NiFi And Real-Time GenAI Generative AI
  • 13.
    13 © 2023 Cloudera,Inc. All rights reserved. WatsonX.AI Granite LLM, NiFi, Kafka & Flink Kafka topics Database Machine learning Flink SQL w/ SSB Lakehouse Data Viz Monitoring Architecture in the context of Travel Advisories DataFlow / NiFi Source Source Alerting
  • 14.
    © 2023 Cloudera,Inc. All rights reserved. 14 Live Q&A Travel Advisories Weather Reports Documents Social Media Databases Transactions Public Data Feeds S3 / Files Logs ATM Data Live Chat … HYBRID CLOUD INTERACT COLLECT STORE ENRICH, REPORT Distribute Collect Report REPORT Visualize Report, Automate AI BASED ENHANCEMENTS Predict, Automate VECTOR DATABASE LLM Machine Learning Data Visualization Data Flow Data Warehouse SQL Stream Builder Data Visualization Input Sentences Generated Text Timestamp Input Sentence Timestamps Enrichments Messaging Broker Real-time alerting Real-time alerting Aggregations
  • 15.
    © 2024 Cloudera,Inc. All rights reserved. 15 © 2022 Cloudera, Inc. All rights reserved. ReadyFlow Gallery Leverage pre-built flow templates to quickly customize and deploy new data flows
  • 16.
    © 2019 Cloudera,Inc. All rights reserved. 16 Cloudera + LLMs Knowledge Repository Data Storage / Management Data Preparation Data Engineering LLM Fine Tuning Process Training Framework LLM Serving Serving Framework Key: CPU Task GPU Task CML CDE CDP Vector DB CDF Streaming Classification Real-Time Model Deployment
  • 17.
    © 2024 Cloudera,Inc. All rights reserved. https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 NiFi 2.0.0 Features ● Python Integration ● Parameters ● JDK 21+ ● JSON Flow Serialization ● Rules Engine for Development Assistance ● Run Process Group as Stateless ● flow.json.gz https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
  • 18.
    © 2024 Cloudera,Inc. All rights reserved. Python Processors
  • 19.
    © 2024 Cloudera,Inc. All rights reserved. Basics
  • 20.
    © 2024 Cloudera,Inc. All rights reserved. Basics
  • 21.
    © 2024 Cloudera,Inc. All rights reserved. Basics
  • 22.
    © 2024 Cloudera,Inc. All rights reserved. Extract Company Names ● Python 3.10+ ● HuggingFace, NLP, SpaCY, PyTorch https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
  • 23.
    © 2024 Cloudera,Inc. All rights reserved. Get Compound GTFS Data ● Python 3.10+ ● GTFS to JSON https://github.com/tspannhw/FLaNK-python-processors/blob/main/GetGTFSCompoundFeed.py
  • 24.
    © 2024 Cloudera,Inc. All rights reserved. Extract Text from Web VTT ● Python 3.10+ ● Web VTT to Text ● Web Video Text Tracks Format Extractor https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API https://github.com/tspannhw/FLaNK-python-processors/blob/main/TranslateWebVTT.py WEBVTT 1 00:00:06.066 --> 00:00:07.166 Now let's talk about 2 00:00:07.166 --> 00:00:12.033 data retrieval, views, and materialized views.
  • 25.
    © 2024 Cloudera,Inc. All rights reserved. WatsonX SDK To Foundation ● Python 3.10+ ● LLM ● WatsonX.AI Foundation Models ● Inference ● Secure ● Official SDK from IBM https://github.com/tspannhw/FLaNK-python-watsonx-processor
  • 26.
    © 2024 Cloudera,Inc. All rights reserved. System / Process Monitoring ● Python 3.10+ ● psutil ● Swap memory, disk, networks
  • 27.
    © 2024 Cloudera,Inc. All rights reserved. Generate Synthetic Records w/ Faker ● Python 3.10+ ● faker ● Choose as many as you want ● Attribute output
  • 28.
    © 2024 Cloudera,Inc. All rights reserved. Download a Wiki Page as HTML or WikiFormat (Text) ● Python 3.10+ ● Wikipedia-api ● HTML or Text ● Choose your wiki page dynamically
  • 29.
    © 2024 Cloudera,Inc. All rights reserved. Get GTFS Data ● Python 3.10+ ● GTFS from Transit URL ● Alerts, Trip Updates or Vehicle Positions ● Returns JSON ● google.transit and google.protobuf
  • 30.
    © 2024 Cloudera,Inc. All rights reserved. Other Python Processors ● Updated Pinecone (Vector DB Interface) ● ChunkDocument, ParseDocument ● ConvertCSVtoExcel ● DetectObjectInImage ● PromptChatGPT ● PutChroma, QueryChroma (Vector DB Interface)
  • 31.
    © 2024 Cloudera,Inc. All rights reserved.
  • 32.
    © 2024 Cloudera,Inc. All rights reserved. DEMO
  • 33.