Databricks for Data Engineering
Build Fast and Reliable Data Pipelines
As companies set their sights on making data-driven decisions or automating business processes with cutting edge technologies such
as machine learning and artificial intelligence, mastering data engineering is an essential step to ensure the infrastructure is in place to
operationalize data pipelines needed to perform analytics against a growing volume of data from multiple sources. The key to success
for a data engineer is to be armed with the right technologies and tools to perform mission-critical data cleansing, transformations, and
manipulations, to make business use cases such as real-time dashboards or fraud detection possible.
Better The Unified Analytics Platform Increases Data Science
Data Engineering DATABRICKS COLLABORATIVE WORKSPACE
Productivity by 5x
with Databricks Explore Data Train Models Serve Models Eliminates Disparate Tools
with Optimized Spark
Founded by the team who created Apache DATABRICKS RUNTIME
Accelerates & Simplifies
Spark,™ Databricks provides a Unified Production Jobs Optimized IO
Data Prep for Analytics
Analytics Platform that accelerates
innovation by unifying data science, Removes Devops &
DATABRICKS DELTA Infrastructure Complexity
engineering, and business. With Databricks,
Data Reliability Automated Performance
data engineers can securely and reliably Open Extensible Platform
deploy production data pipelines with ease. DATABRICKS SERVERLESS
+ more
DATABRICKS
ENTERPRISE SECURITY
IoT / STREAMING DATA CLOUD STORAGE DATA WAREHOUSES HADOOP STORAGE
Accelerate Performance with Databricks Runtime, Built on Apache Spark
DATABRICKS IO DATABRICKS SERVERLESS FULLY MANAGED IN THE CLOUD
Leverages a vertically integrated stack to A serverless architecture that democratizes A cloud-native platform that abstracts the
optimize the I/O layer and processing layer infrastructure through the auto- complexities of big data infrastructure,
to significantly improve the performance of configuration and scaling of compute resulting in a highly elastic, reliable and
Spark in the cloud. resources — enabling best-in-class performant platform to build innovative
performance at dramatically lower costs. products.
The Fastest Big Data Platform in the Cloud
5X FASTER
5X FASTER 8XFASTER
8X FASTER 3XFASTER
3X FASTER
THAN VANILLA APACHE SPARK ON AWS THAN APACHE PRESTO ON AWS THAN ON-PREMISES IMPALA VIA CLOUDERA
THAN VANILLA APACHE SPARK ON AWS THAN APACHE PRESTO ON AWS THAN ON-PREMISES IMPALA VIA CLOUDERA
Spark on Spark on Spark on
11,674 35.3 1,149,264
Databricks Databricks Databricks
Spark on 53,783 Presto on 293 Cloudera 3,331,440
AWS AWS Impala
0 15000 30000 45000 60000 0 75 150 225 300 0 75 150 225 300
Runtime total on 104 queries Runtime geomean on 62 queries Runtime total on 77 Impala queries, normalized by
(secs — lower is better) (secs — lower is better) CPU cores (CPU time — lower is better)
Read the blog: databricks.com/cloud-benchmarks
Streamline Processes from ETL Databricks Enterprise Security
to Production STRONG DATA ENCRYPTION
Benefit from best-in-class data protection at rest and in motion.
PRODUCTION WORKFLOWS
A unified platform that streamlines end-to-end workflows from INTEGRATED IDENTITY MANAGEMENT
data ingest and ETL, to data exploration and model building, to Seamless integration with enterprise identity providers via
productionizing models and data products. SAML 2.0 and Active Directory.
UNIFYING ALL ANALYTICS ROLE-BASED ACCESS CONTROLS
Move seamlessly across various types of analytics including batch, ad Fine-grained management access to every component of the
hoc, machine learning, deep learning, stream processing, and graph. enterprise data infrastructure, including files, clusters, code,
application deployments, and dashboards.
ROBUST INTEGRATIONS
Plug into a wide variety of AWS tools and data stores with built-in MONITOR AND AUDITING
connectors and integrate with other data engineering services to Tap into comprehensive audit logs to monitor and troubleshoot
facilitate CI/CD with comprehensive APIs. issues.
COMPLIANCE STANDARDS
We were able to reduce data Databricks has successfully completed SOC 2 Type 2 certification
and can offer a HIPAA-compliant solution.
processing time from 48 hours
to 45 minutes with Databricks.
– Dennis Vallinga, Business Analyst, Shell
Our Spark Expertise is our Edge Lower TCO
SUPPORT BETTER PERFORMANCE
Unparalleled Apache Spark support by the creators of Apache Spark. Performance-tuned clusters allow you to complete jobs in a shorter
time, reducing cloud compute costs.
SERVICES
Faster innovations with Databricks and Spark with solution FULLY-MANAGED CLUSTERS
architecting and workload optimization services. Further reduce costs by avoiding the time-consuming tasks to build,
configure, and maintain complex Spark infrastructure.
ALWAYS AVAILABLE
Around-the-clock coverage to ensure problems are resolved quickly, PAY FOR ONLY WHAT YOU USE
with response times as fast as one hour for production tier support. Billing up to the nearest second keeps your costs down.
ENGINEER RESOURCES PRICED FOR DATA ENGINEERING
Online library of documentation, best practices, user guides, and Lower price point for data engineering production workloads.
other technical resources.
Data Engineering, Simplified
Databricks’ Unified Analytics Platform removes the complexity of data engineering while accelerating performance of data engineering tasks
from data access to ETL, allowing engineers to build fast and reliable data pipelines more easily to support the business.
Get started with Databricks for data engineering today with a free trial.
© Databricks 2018. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.