InfoQ Homepage Apache Hadoop Content on InfoQ
-
Pinterest Automates Hadoop Cluster Scaling and Migration with Internal Orchestration System
Recently, Pinterest disclosed its internal orchestration framework, called Hadoop Control Center (HCC), to automate the scaling and migration of its large-scale Hadoop clusters. This move addresses the operational complexity and limitations Pinterest previously faced when managing thousands of nodes across dozens of YARN clusters on AWS.
-
From Hadoop to Kubernetes: Pinterest’s Scalable Spark Architecture on AWS EKS
Pinterest revamped its data infrastructure by transitioning from a legacy Hadoop system to the Moka platform, leveraging Kubernetes and Spark on AWS EKS. This strategic shift enhances job isolation, simplifies deployment, and optimizes resource management, leading to reduced costs and improved efficiency.
-
Apache Hudi 1.0 Now Generally Available
The Apache Software Foundation has recently announced the general availability of Apache Hudi 1.0, the transactional data lake platform with support for near real-time analytics. Initially introduced in 2017, Apache Hudi provides an open table format optimized for efficient writes in incremental data pipelines and fast query performance.
-
Uber’s Journey to Modernizing Big Data Infrastructure with Google Cloud Platform
In a recent post on its official engineering blog, Uber, disclosed its strategy to migrate the batch data analytics and machine learning (ML) training stack to Google Cloud Platform (GCP). Uber, runs one of the largest Hadoop installations in the world, managing over an exabyte of data across tens of thousands of servers in each of its two regions
-
LinkedIn Migrates away from Lambda Architecture to Reduce Complexity
Software engineers from LinkedIn recently published how they migrated away from a Lambda architecture. The Lambda architecture implementation caused their solution to have high operational overhead and added complexity, leading to slow product iteration times. As a result, the engineers chose to migrate to a Lambda-less architecture, resulting in significant development velocity improvements.
-
ApacheCon 2019 Keynote: Google Cloud Enhances Big-Data Processing with Kubernetes
At ApacheCon North America, Christopher Crosbie gave a keynote talk title "Yet Another Resource Negotiator for Big Data? How Google Cloud is Enhancing Data Lake Processing with Kubernetes." He highlighted Google's efforts to make Apache big-data software "cloud native" by developing open-source Kubernetes Operators to provide control planes for running Apache software in a Kubernetes cluster.
-
Google Introduces Cloud Storage Connector for Hadoop Big Data Workloads
In a recent blog post, Google announced a new Cloud Storage connector for Hadoop. This new capability allows organizations to substitute their traditional HDFS with Google Cloud Storage. Columnar file formats such as Parquet and ORC may realize increased throughput, and customers will benefit from Cloud Storage directory isolation, lower latency, increased parallelization and intelligent defaults
-
The Evolution of Uber’s 100+ Petabyte Big Data Platform
Uber’s engineering team wrote about how their big data platform evolved from traditional ETL jobs with relational databases to one based on Hadoop and Spark. A scalable ingestion model, standard transfer format and a custom library for incremental updates are the key components of the platform.
-
Cloudera and Hortonworks Merge with Goal to Increase Competition with Cloud Offerings
Earlier this month, Cloudera and Hortonworks announced an all-stock merger at a combined value of around $5.2 billion. Analysts have argued that this merger is aimed at increased competition that both companies are facing from cloud vendors like Amazon, Google and Microsoft. In this article we log reactions from analysts and the industry, and the implications for current customers.
-
Q&A with Microsoft's Arindam Chatterjee Discussing Azure HDInsight 4.0
InfoQ caught up with Arindam Chatterjee, principal group manager at Microsoft, regarding the announcements about HDInsight at Microsoft Ignite.
-
Q&A with Saumitra Buragohain on Hortonworks Data Platform 3.0
InfoQ caught up with Saumitra Buragohain, senior director of Product Management at Hortonworks, regarding Hadoop in general and HDP 3.0 in particular.
-
Apache HBase 1.3 Ships with Multiple Performance Improvements
Apache HBase 1.3.0 was released mid-January 2017 and ships with support for date-based tiered compaction and improvements in multiple areas, like write-ahead log (WAL), and a new RPC scheduler, among others. The release includes almost 1,700 resolved issues in total.
-
Glenn Tamkin on Applying Apache Hadoop to NASA's Big Climate Data
NASA Center for Climate Simulation (NCCS) is using Apache Hadoop for high-performance data analytics. Glenn Tamkin from NASA team, recently spoke at ApacheCon Conference and shared the details of the platform they built for climate data analysis with Hadoop.
-
Pivotal Open Sources Their Big Data Suite
Pivotal has decided to open source core components of their Big Data Suite and has announced the Open Data Platform, an initiative promoting open source and standardization for Big Data.
-
Project Myriad: Mesos and YARN Working Together
An article by Jin Scott - A tale of two clusters: Mesos and YARN – describes hardware silos created by using different resource managers on different hardware clusters, most popular being Mesos and Yarn and introduces Myriad – a solution allowing to run a YARN cluster on Mesos.