Developing High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid CBRT Big Data Team Emre Tokel, Kerem Başol, M. Yağmur Şahin Zekeriya Besiroglu / Komtas Bilgi Yonetimi 21 March 2019 Barcelona
Agenda WHO WE ARE CBRT & Our Team PROJECT DETAILS Before, Test Cluster, Phase 1-2-3, Prod Migration HIGH FREQUENCY INDICATORS Importance & Goals CURRENT ARCHITECTURE Apache Kafka, Spark, Druid & Superset WORK IN PROGRESS Further analyses FUTURE PLANS 6 5 4 3 2 1
Who We Are 1
Our Solutions Data Management • Data Governance Solutions • Next Generation Analytics • 360 Engagement • Data Security Analytics • Data Warehouse Solutions • Customer Journey Analytics • Advanced Marketing Analytics Solutions • Industry-specific analytic use cases • Online Customer Data Platform • IoT Analytics • Analytic Lab Solution Big Data & AI • Big Data & AI Advisory Services • Big Data & AI Accelerators • Data Lake Foundation • EDW Optimization / Offloading • Big Data Ingestion and Governance • AI Implementation – Chatbot • AI Implementation – Image Recognition Security Analytics • Security Analytic Advisory Services • Integrated Law Enforcement Solutions • Cyber Security Solutions • Fraud Analytics Solutions • Governance, Risk & Compliance Solutions
• +20 IT , +18 DB&DWH • +7 BIG DATA • Lead Archtitect &Big Data /Analytics @KOMTAS • Instructor&Consultant • ITU,MEF,Şehir Uni. BigData Instr. • Certified R programmer • Certified Hadoop Administrator
Our Organization § The Central Bank of the Republic of Turkey is primarily responsible for steering the monetary and exchange rate policies in Turkey. o Price stability o Financial stability o Exchange rate regime o The privilege of printing and issuing banknotes o Payment systems
• Big Data Engineer• Big Data Engineer M. Yağmur Şahin Emre Tokel Kerem Başol • Big Data Team Leader
High Frequency Indicators 2 1
Importance and Goals § To observe foreign exchange markets in real-time o Are there any patterns regarding to specific time intervals during the day? o Is there anything to observe before/after local working hours throughout the whole day? o What does the difference between bid/ask prices tell us? § To be able to detect risks and take necessary policy measures in a timely manner o Developing liquidity and risk indicators based real-time tick data o Visualizing observations for decision makers in real-time o Finally, discovering possible intraday seasonality § Wouldn’t it be great to be able to correlate with news flow as well?
Project Details 3 2 1
Development of High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migratio n Next phases Test Cluster Phase 2 Phase 3
Test Cluster § Our first studies on big data have started on very humble servers o 5 servers with 32 GB RAM for each o 3 TB storage § HDP 2.6.0.3 installed o Not the latest version back then § Technical difficulties o Performance problems o Apache Druid indexing o Apache Superset maturity
Development of High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migratio n Next phases Test Cluster Phase 2 Phase 3
TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
Thomson Reuters Enterprise Platform (TREP) § Thomson Reuters provides its subscribers with an enterprise platform that they can collect the market data as it is generated § Each financial instrument on TREP has a unique code called RIC § The event queue implemented by the platform can be consumed with the provided Java SDK § We developed a Java application for consuming this event queue to collect tick-data according to required RICs
TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
Apache Kafka § The data flow is very fast and quite dense o We published the messages containing tick data collected by our Java application to a message queue o Twofold analysis: Batch and real-time § We decided to use Apache Kafka residing on our test big data cluster § We created a topic for each RIC on Apache Kafka and published data to related topics
TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
Apache NiFi § In order to manage the flow, we decided to use Apache NiFi § We used KafkaConsumer processor to consume messages from Kafka queues § The NiFi flow was designed to be persisted on MongoDB
Our NiFi Flow
TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
MongoDB § We had prepared data in JSON format with our Java application § Since we have MongoDB installed on our enterprise systems, we decided to persist this data to MongoDB § Although MongoDB is not a part of HDP, it seemed as a good choice for our researchers to use this data in their analyses
TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
Apache Zeppelin § We provided our researchers with access to Apache Zeppelin and connection to MongoDB via Python § By doing so, we offered an alternative to the tools on local computers and provided a unified interface for financial analysis
Business Intelligence on Client Side § Our users had to download daily tick-data manually from their Thomson Reuters Terminals and work on Excel § Users were then able to access tick-data using Power BI o We also provided our users with a news timeline along with the tick-data
We needed more! § We had to visualize the data in real-time o Analysis on persisted data using MongoDB, PowerBI and Apache Zeppelin was not enough
TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
Development of High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migratio n Next phases Test Cluster Phase 2 Phase 3
TREP API Apache Kafka Apache Druid Apache Superset
Apache Druid § We needed a database which was able to: o Answer ad-hoc queries (slice/dice) for a limited window efficiently o Store historic data and seamlessly integrate current and historic data o Provide native integration with possible real-time visualization frameworks (preferably from Apache stack) o Provide native integration with Apache Kafka § Apache Druid addressed all the aforementioned requirements § Indexing task was achieved using Tranquility
TREP API Apache Kafka Apache Druid Apache Superset
Apache Superset § Apache Superset was the obvious alternative for real-time visualization since tick-data was stored on Apache Druid o Native integration with Apache Druid o Freely available on Hortonworks service stack § We prepared real-time dashboards including: o Transaction Count o Bid / Ask Prices o Contributor Distribution o Bid - Ask Spread
We needed more, again! § Reliability issues with Druid § Performance issues § Enterprise integration requirements
Development of High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migratio n Next phases Test Cluster Phase 2 Phase 3
Architecture Internet Data Enterprise Content Social Media/Media Micro Level Data Commercial Data Vendors Ingestion Big Data Platform Data Science GovernanceData Sources
Development of High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migratio n Next phases Test Cluster Phase 2 Phase 3
TREP API Apache Kafka Apache Hive + Druid Integration Apache Spark Apache Superset
Apache Hive + Druid Integration § After setting up our production environment (using HDP 3.0.1.0) and started to feed data, we realized that data were scattered and we were missing the option to co-utilize these different data sources § We then realized that Apache Hive was already providing Kafka & Druid indexing service in the form of a simple table creation and querying facility for Druid from Hive
TREP API Apache Kafka Apache Hive + Druid Integration Apache Spark Apache Superset
Apache Spark § Due to additional calculation requirements of our users, we decided to utilize Apache Spark § With Apache Spark 2.4, we used Spark Streaming and Spark SQL contexts together in the same application § In our Spark application o For every 5 seconds, a 30-second window is created o On each window, outlier boundaries are calculated o Outlier data points are detected
Current Architecture 4 3 2 1
Current Architecture & Progress So Far Java Application Kafka Topic (real-time) Kafka Topic (windowed) TREP Event Queue Consume Publish Spark Application Consume Publish Druid Datasource (real-time) Druid Datasource (windowed) Superset Dashboard (tick data) Superset Dashboard (outlier)
TREP Data Flow
Windowed Spark Streaming
Tick-Data Dashboard
Outlier Dashboard
Work in Progress 5 4 3 2 1
Implementing… § Moving average calculation (20-day window) § Volatility Indicator § Average True Range Indicator (moving average) o [ max(t) - min(t) ] o [ max(t) - close(t-1) ] o [ max(t) - close(t-1) ]
Future Plans 6 5 4 3 2 1
To-Do List § Matching data subscription § Bringing historical tick data into real-time analysis § Possible use of machine learning for intraday indicators
Thank you! Q & A

Developing high frequency indicators using real time tick data on apache superset and druid

  • 1.
    Developing High Frequency IndicatorsUsing Real-Time Tick Data on Apache Superset and Druid CBRT Big Data Team Emre Tokel, Kerem Başol, M. Yağmur Şahin Zekeriya Besiroglu / Komtas Bilgi Yonetimi 21 March 2019 Barcelona
  • 2.
    Agenda WHO WE ARE CBRT& Our Team PROJECT DETAILS Before, Test Cluster, Phase 1-2-3, Prod Migration HIGH FREQUENCY INDICATORS Importance & Goals CURRENT ARCHITECTURE Apache Kafka, Spark, Druid & Superset WORK IN PROGRESS Further analyses FUTURE PLANS 6 5 4 3 2 1
  • 3.
  • 4.
    Our Solutions Data Management •Data Governance Solutions • Next Generation Analytics • 360 Engagement • Data Security Analytics • Data Warehouse Solutions • Customer Journey Analytics • Advanced Marketing Analytics Solutions • Industry-specific analytic use cases • Online Customer Data Platform • IoT Analytics • Analytic Lab Solution Big Data & AI • Big Data & AI Advisory Services • Big Data & AI Accelerators • Data Lake Foundation • EDW Optimization / Offloading • Big Data Ingestion and Governance • AI Implementation – Chatbot • AI Implementation – Image Recognition Security Analytics • Security Analytic Advisory Services • Integrated Law Enforcement Solutions • Cyber Security Solutions • Fraud Analytics Solutions • Governance, Risk & Compliance Solutions
  • 5.
    • +20 IT, +18 DB&DWH • +7 BIG DATA • Lead Archtitect &Big Data /Analytics @KOMTAS • Instructor&Consultant • ITU,MEF,Şehir Uni. BigData Instr. • Certified R programmer • Certified Hadoop Administrator
  • 6.
    Our Organization § TheCentral Bank of the Republic of Turkey is primarily responsible for steering the monetary and exchange rate policies in Turkey. o Price stability o Financial stability o Exchange rate regime o The privilege of printing and issuing banknotes o Payment systems
  • 7.
    • Big DataEngineer• Big Data Engineer M. Yağmur Şahin Emre Tokel Kerem Başol • Big Data Team Leader
  • 8.
  • 9.
    Importance and Goals §To observe foreign exchange markets in real-time o Are there any patterns regarding to specific time intervals during the day? o Is there anything to observe before/after local working hours throughout the whole day? o What does the difference between bid/ask prices tell us? § To be able to detect risks and take necessary policy measures in a timely manner o Developing liquidity and risk indicators based real-time tick data o Visualizing observations for decision makers in real-time o Finally, discovering possible intraday seasonality § Wouldn’t it be great to be able to correlate with news flow as well?
  • 10.
  • 11.
    Development of HighFrequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migratio n Next phases Test Cluster Phase 2 Phase 3
  • 12.
    Test Cluster § Ourfirst studies on big data have started on very humble servers o 5 servers with 32 GB RAM for each o 3 TB storage § HDP 2.6.0.3 installed o Not the latest version back then § Technical difficulties o Performance problems o Apache Druid indexing o Apache Superset maturity
  • 13.
    Development of HighFrequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migratio n Next phases Test Cluster Phase 2 Phase 3
  • 14.
    TREP API Apache Kafka Apache NiFiMongoDB Apache Zeppelin & Power BI
  • 15.
    Thomson Reuters EnterprisePlatform (TREP) § Thomson Reuters provides its subscribers with an enterprise platform that they can collect the market data as it is generated § Each financial instrument on TREP has a unique code called RIC § The event queue implemented by the platform can be consumed with the provided Java SDK § We developed a Java application for consuming this event queue to collect tick-data according to required RICs
  • 16.
    TREP API Apache Kafka Apache NiFiMongoDB Apache Zeppelin & Power BI
  • 17.
    Apache Kafka § Thedata flow is very fast and quite dense o We published the messages containing tick data collected by our Java application to a message queue o Twofold analysis: Batch and real-time § We decided to use Apache Kafka residing on our test big data cluster § We created a topic for each RIC on Apache Kafka and published data to related topics
  • 18.
    TREP API Apache Kafka Apache NiFiMongoDB Apache Zeppelin & Power BI
  • 19.
    Apache NiFi § Inorder to manage the flow, we decided to use Apache NiFi § We used KafkaConsumer processor to consume messages from Kafka queues § The NiFi flow was designed to be persisted on MongoDB
  • 20.
  • 21.
    TREP API Apache Kafka Apache NiFiMongoDB Apache Zeppelin & Power BI
  • 22.
    MongoDB § We hadprepared data in JSON format with our Java application § Since we have MongoDB installed on our enterprise systems, we decided to persist this data to MongoDB § Although MongoDB is not a part of HDP, it seemed as a good choice for our researchers to use this data in their analyses
  • 23.
    TREP API Apache Kafka Apache NiFiMongoDB Apache Zeppelin & Power BI
  • 24.
    Apache Zeppelin § Weprovided our researchers with access to Apache Zeppelin and connection to MongoDB via Python § By doing so, we offered an alternative to the tools on local computers and provided a unified interface for financial analysis
  • 25.
    Business Intelligence onClient Side § Our users had to download daily tick-data manually from their Thomson Reuters Terminals and work on Excel § Users were then able to access tick-data using Power BI o We also provided our users with a news timeline along with the tick-data
  • 26.
    We needed more! §We had to visualize the data in real-time o Analysis on persisted data using MongoDB, PowerBI and Apache Zeppelin was not enough
  • 27.
    TREP API Apache Kafka Apache NiFiMongoDB Apache Zeppelin & Power BI
  • 28.
    Development of HighFrequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migratio n Next phases Test Cluster Phase 2 Phase 3
  • 29.
  • 30.
    Apache Druid § Weneeded a database which was able to: o Answer ad-hoc queries (slice/dice) for a limited window efficiently o Store historic data and seamlessly integrate current and historic data o Provide native integration with possible real-time visualization frameworks (preferably from Apache stack) o Provide native integration with Apache Kafka § Apache Druid addressed all the aforementioned requirements § Indexing task was achieved using Tranquility
  • 31.
  • 32.
    Apache Superset § ApacheSuperset was the obvious alternative for real-time visualization since tick-data was stored on Apache Druid o Native integration with Apache Druid o Freely available on Hortonworks service stack § We prepared real-time dashboards including: o Transaction Count o Bid / Ask Prices o Contributor Distribution o Bid - Ask Spread
  • 33.
    We needed more,again! § Reliability issues with Druid § Performance issues § Enterprise integration requirements
  • 34.
    Development of HighFrequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migratio n Next phases Test Cluster Phase 2 Phase 3
  • 35.
    Architecture Internet Data Enterprise Content SocialMedia/Media Micro Level Data Commercial Data Vendors Ingestion Big Data Platform Data Science GovernanceData Sources
  • 36.
    Development of HighFrequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migratio n Next phases Test Cluster Phase 2 Phase 3
  • 37.
    TREP API ApacheKafka Apache Hive + Druid Integration Apache Spark Apache Superset
  • 38.
    Apache Hive +Druid Integration § After setting up our production environment (using HDP 3.0.1.0) and started to feed data, we realized that data were scattered and we were missing the option to co-utilize these different data sources § We then realized that Apache Hive was already providing Kafka & Druid indexing service in the form of a simple table creation and querying facility for Druid from Hive
  • 39.
    TREP API ApacheKafka Apache Hive + Druid Integration Apache Spark Apache Superset
  • 40.
    Apache Spark § Dueto additional calculation requirements of our users, we decided to utilize Apache Spark § With Apache Spark 2.4, we used Spark Streaming and Spark SQL contexts together in the same application § In our Spark application o For every 5 seconds, a 30-second window is created o On each window, outlier boundaries are calculated o Outlier data points are detected
  • 42.
  • 43.
    Current Architecture &Progress So Far Java Application Kafka Topic (real-time) Kafka Topic (windowed) TREP Event Queue Consume Publish Spark Application Consume Publish Druid Datasource (real-time) Druid Datasource (windowed) Superset Dashboard (tick data) Superset Dashboard (outlier)
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
    Implementing… § Moving averagecalculation (20-day window) § Volatility Indicator § Average True Range Indicator (moving average) o [ max(t) - min(t) ] o [ max(t) - close(t-1) ] o [ max(t) - close(t-1) ]
  • 50.
  • 51.
    To-Do List § Matchingdata subscription § Bringing historical tick data into real-time analysis § Possible use of machine learning for intraday indicators
  • 52.