Loading data into Apache Ignite Stephen Darlington 01 May 2019 2019 © GridGain Systems
2019 © GridGain Systems GridGain Company Confidential Apache Ignite In-Memory Computing Platform Mainframe NoSQL HadoopIgnite Persistence Persistent Layer RDBMS Machine and Deep Learning EventsStreamingMessagingTransactionsSQLKey-Value Service GridCompute Grid Application Layer Web SaaS SocialMobile IoT In-Memory Data Store
2019 © GridGain Systems GridGain Company Confidential How do I load data? This Photo by Unknown Author is licensed under CC BY-SA
2019 © GridGain Systems GridGain Company Confidential Official answer 1. Open your IDE 2. Create a project 3. Edit pom.xml to include Apache Ignite libraries 4. Create a new class 5. Code to open and parse input file 6. Boilerplate Ignite cluster code 7. IgniteDataStreamer code 8. Debug 9. Edit 10. Debug 11. Edit 12. Debug 13. Run 14. Play with resulting data
2019 © GridGain Systems GridGain Company Confidential2019 © GridGain Systems There must be an easier way? 8 This Photo by Unknown Author is licensed under CC BY-NC- ND
2019 © GridGain Systems GridGain Company Confidential Apache Ignite In-Memory Computing Platform Mainframe NoSQL HadoopIgnite Persistence Persistent Layer RDBMS Machine and Deep Learning EventsStreamingMessagingTransactionsKey-Value Service GridCompute Grid Application Layer Web SaaS SocialMobile IoT In-Memory Data Store SQL
2019 © GridGain Systems GridGain Company Confidential Using SQL
2019 © GridGain Systems GridGain Company Confidential But it gets complicated…
2019 © GridGain Systems GridGain Company Confidential SQL Streaming Apache Ignite In-Memory Computing Platform Mainframe NoSQL HadoopIgnite Persistence Persistent Layer RDBMS Machine and Deep Learning EventsMessagingTransactions Service GridCompute Grid Application Layer Web SaaS SocialMobile IoT In-Memory Data Store Key-Value
2019 © GridGain Systems GridGain Company Confidential Using Python
2019 © GridGain Systems GridGain Company Confidential SQL Apache Ignite In-Memory Computing Platform Mainframe NoSQL HadoopIgnite Persistence Persistent Layer RDBMS Machine and Deep Learning EventsMessagingTransactionsKey-Value Service GridCompute Grid Application Layer Web SaaS SocialMobile IoT In-Memory Data Store Streaming
2019 © GridGain Systems GridGain Company Confidential Using Apache Spark
2019 © GridGain Systems GridGain Company Confidential Using Apache Spark
2019 © GridGain Systems GridGain Company Confidential2019 © GridGain Systems What did we learn? 17 • Many options – Python, Spark, SQL – Scala – Groovy – Node.js • No one “best” answer • REPLs are awesome – …and can be used for a lot more than just loading data
2019 © GridGain Systems GridGain Company Confidential Resources • Apache Ignite documentation – https://apacheignite.readme.io/docs – https://ignite.apache.org • Blog – Loading Data into Ignite. https://link.medium.com/66dzsrWw4V – Python, part 1. https://link.medium.com/CUjDnzBQcW – Python, part 2. https://link.medium.com/3dWH1oDQcW
2019 © GridGain Systems GridGain Company Confidential And finally… • Get a free ticket to the In-Memory Computing Summit Europe 2019 (June 3-4) by completing this survey: – http://bit.ly/IMCSeu2019 • More information here: – https://www.imcsummit.org/2019/eu/
2019 © GridGain Systems GridGain Company Confidential2019 © GridGain Systems Thank you 20 Stephen Darlington Senior Consultant GridGain Systems @sdarlington

Loading data into Apache Ignite

  • 1.
    Loading data intoApache Ignite Stephen Darlington 01 May 2019 2019 © GridGain Systems
  • 2.
    2019 © GridGainSystems GridGain Company Confidential Apache Ignite In-Memory Computing Platform Mainframe NoSQL HadoopIgnite Persistence Persistent Layer RDBMS Machine and Deep Learning EventsStreamingMessagingTransactionsSQLKey-Value Service GridCompute Grid Application Layer Web SaaS SocialMobile IoT In-Memory Data Store
  • 3.
    2019 © GridGainSystems GridGain Company Confidential How do I load data? This Photo by Unknown Author is licensed under CC BY-SA
  • 4.
    2019 © GridGainSystems GridGain Company Confidential Official answer 1. Open your IDE 2. Create a project 3. Edit pom.xml to include Apache Ignite libraries 4. Create a new class 5. Code to open and parse input file 6. Boilerplate Ignite cluster code 7. IgniteDataStreamer code 8. Debug 9. Edit 10. Debug 11. Edit 12. Debug 13. Run 14. Play with resulting data
  • 5.
    2019 © GridGainSystems GridGain Company Confidential2019 © GridGain Systems There must be an easier way? 8 This Photo by Unknown Author is licensed under CC BY-NC- ND
  • 6.
    2019 © GridGainSystems GridGain Company Confidential Apache Ignite In-Memory Computing Platform Mainframe NoSQL HadoopIgnite Persistence Persistent Layer RDBMS Machine and Deep Learning EventsStreamingMessagingTransactionsKey-Value Service GridCompute Grid Application Layer Web SaaS SocialMobile IoT In-Memory Data Store SQL
  • 7.
    2019 © GridGainSystems GridGain Company Confidential Using SQL
  • 8.
    2019 © GridGainSystems GridGain Company Confidential But it gets complicated…
  • 9.
    2019 © GridGainSystems GridGain Company Confidential SQL Streaming Apache Ignite In-Memory Computing Platform Mainframe NoSQL HadoopIgnite Persistence Persistent Layer RDBMS Machine and Deep Learning EventsMessagingTransactions Service GridCompute Grid Application Layer Web SaaS SocialMobile IoT In-Memory Data Store Key-Value
  • 10.
    2019 © GridGainSystems GridGain Company Confidential Using Python
  • 11.
    2019 © GridGainSystems GridGain Company Confidential SQL Apache Ignite In-Memory Computing Platform Mainframe NoSQL HadoopIgnite Persistence Persistent Layer RDBMS Machine and Deep Learning EventsMessagingTransactionsKey-Value Service GridCompute Grid Application Layer Web SaaS SocialMobile IoT In-Memory Data Store Streaming
  • 12.
    2019 © GridGainSystems GridGain Company Confidential Using Apache Spark
  • 13.
    2019 © GridGainSystems GridGain Company Confidential Using Apache Spark
  • 14.
    2019 © GridGainSystems GridGain Company Confidential2019 © GridGain Systems What did we learn? 17 • Many options – Python, Spark, SQL – Scala – Groovy – Node.js • No one “best” answer • REPLs are awesome – …and can be used for a lot more than just loading data
  • 15.
    2019 © GridGainSystems GridGain Company Confidential Resources • Apache Ignite documentation – https://apacheignite.readme.io/docs – https://ignite.apache.org • Blog – Loading Data into Ignite. https://link.medium.com/66dzsrWw4V – Python, part 1. https://link.medium.com/CUjDnzBQcW – Python, part 2. https://link.medium.com/3dWH1oDQcW
  • 16.
    2019 © GridGainSystems GridGain Company Confidential And finally… • Get a free ticket to the In-Memory Computing Summit Europe 2019 (June 3-4) by completing this survey: – http://bit.ly/IMCSeu2019 • More information here: – https://www.imcsummit.org/2019/eu/
  • 17.
    2019 © GridGainSystems GridGain Company Confidential2019 © GridGain Systems Thank you 20 Stephen Darlington Senior Consultant GridGain Systems @sdarlington

Editor's Notes

  • #2 Inspired by trying to get up-to-speed with a new, shiny project. Anything data centric, whether machine learning or SQL, needs data. I work for GG, donated Ignite, blah
  • #3 Have you heard of Apache Ignite or GridGain? GridGain Systems donated the code to the Apache Ignite project. It became a top level project of the Apache Software Foundation (ASF) in 2014, the second fastest to do so. Apache Ignite is now one of the top 5 Apache Software Foundation projects, and has been for 2 years now. It’s the most active in-memory computing projects right now, used by thousands of companies worldwide. GridGain is the only commercially supported version. It adds integration, security, deployment, management and monitoring to the same core Ignite that help with business-critical applications. We also provide global support and services. We also continue to be the biggest contributor to Ignite. [1] http://globenewswire.com/news-release/2019/07/09/1534470/0/en/The-Apache-Software-Foundation-Announces-Annual-Report-for-2019-Fiscal-Year.html [2] https://blogs.apache.org/foundation/entry/apache-in-2017-by-the
  • #4 You are probably relying on us for some part of your personal or professional life. We have several of the top 20 banks and wealth management companies as customers. If you include FinTech, 48-50 of the world’s largest banks use us indirectly. (through Finastra) Some of the leading software companies rely on us for their speed and scale. Microsoft uses us for real-time cloud security detection. Workday used us to get the scale they needed to sell to Walmart, and then to be about to run their software on Amazon, for Amazon. There are some very large retail/e-commerce companies, including PayPal, HomeAway and Expedia. And several innovators across FinTech, adTech, IoT and other areas.
  • #6 Traditional databases don’t scale. Buy bigger and bigger boxes until you run out of money. Traditional compute grids have to copy data across the network, which at modern scale is just impractical. Ignite scales horizontally and sends compute to the data rather than the other way around. In memory for speed. Disk persistence for volume.
  • #7 You fired up a node and you want to play… how do you load data? Oracle has SQL*Loader. Most other legacy databases have something similar. Is there an Ignite equivalent?
  • #8 Simple 14 point process
  • #9 Okay, I’m being facetious. That approach is good for production. For large volumes of data. For weird and wonderful data formats. But what if you want to do something quickly, preferably without firing up an IDE?
  • #10 Ignite supports ANSI-99 SQL…
  • #11 Kind of like BULK INSERT in SQL Server. Kind of like SQL*Loader in Oracle Good news: built-in Bad news: only works for CSV Basically zero configuration sqlline -u jdbc:ignite:thin://127.0.0.1 0: jdbc:ignite:thin://127.0.0.1>COPY FROM "file.csv" INTO tablename (col1, col2) FORMAT CSV;
  • #12 Which means you end up using horrible command-line tricks to convert data into CSV format. Here we’re using jq to convert from JSON to CSV jq '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv' < file.json > file.csv
  • #13 Python
  • #15 Spark – kind of cheating 
  • #16 Start pyspark with a bunch of extra libraries so that it also understands Ignite. This is optimized for typing. You could also optimize for less code in memory. bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar,$IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar,$IGNITE_HOME/libs/*.jar,$IGNITE_HOME/libs/ignite-indexing/*.jar
  • #17 In one line we read a JSON file It understands the structure of the file – no further coding Filters, drop columns, etc. Functional. b = spark.read.format('json').load('filename.json’) b.filter('href is not null’) \ .drop('hash', 'meta’) \ .write.format('ignite’) \ .option('config','default-config.xml’) \ .option('table','bookmarks’) \ .option('primaryKeyFields','href’) \ .mode('overwrite’) \ .save()