InfoQ Homepage Infrastructure Content on InfoQ

Articles

RSS Feed

Newer Older

AI, ML & Data Engineering

Disaggregation in Large Language Models: the Next Evolution in AI Infrastructure

Large Language Model (LLM) inference faces a fundamental challenge: the same hardware that excels at processing input prompts struggles with generating responses, and vice versa. Disaggregated serving architectures solve this by separating these distinct computational phases, delivering throughput improvements and better resource utilization while reducing costs.

Anat Heilper
on Sep 29, 2025
Cloud

Ransomware-Resilient Storage: the New Frontline Defense in a High-Stakes Cyber Battle

Cybersecurity has evolved, with ransomware now primarily targeting data storage and backups. To combat this, modern defense strategies focus on making storage systems more resilient. Key tactics include using immutable storage that prevents data from being altered or deleted, employing AI-powered detection, and implementing air-gapping to create isolated, tamper-proof recovery points.

Arjun Mullick
on Aug 25, 2025
Cloud

Zero-Downtime Critical Cloud Infrastructure Upgrades at Scale

Engineers can avoid common pitfalls in large-scale infrastructure upgrades by studying others' experiences. The article provides lessons learned from big firms like eBay and Snowflake, offering solutions for legacy systems, performance validation, and rollback planning. It emphasizes systematic preparation and clear communication to handle challenges and ensure zero-downtime upgrades at scale.

Kiran Bhat
on Aug 18, 2025
Architecture & Design

One Network: Cloud-Agnostic Service and Policy-Oriented Network Architecture

Bringing together software infrastructure leads to faster development time and easy control of large, spread-out systems through clear rules. In this QCon SF 2024 presentation, Anna Berenberg shared learnings and achievements when building One Network, addressing complex infrastructure layers, open-source integration, and uniform policy enforcement for improved reliability and security.

Anna Berenberg
on Aug 12, 2025
DevOps

Ceph RBD Turns 15: a Story of Open Source Creation

Fifteen years ago, Ceph RBD began as a community-driven idea that grew into essential infrastructure powering today's cloud platforms. This insider story from Yehuda Sadeh-Weinraub reveals how two developers started a distributed storage that now supports OpenStack and Kubernetes through transparent, collaborative development.

Yehuda Sadeh-Weinraub
on Jul 07, 2025
DevOps

Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies

Proficient in analyzing the dynamics of Apache Kafka Stretch Clusters, I assess WAN disruptions and devise effective Disaster Recovery (DR) strategies. With deep expertise, I ensure high availability and data integrity across multi-region deployments. My insights optimize operational resilience, safeguarding vital services against service level agreement violations.

Srikanth Daggumalli Nishchai Jayanna Manjula
on Jun 20, 2025
Cloud

Designing Resilient Event-Driven Systems at Scale

Learn how to design resilient event-driven systems that scale. Explore key patterns like shuffle sharding and decoupling queues to handle load spikes and failures. Understand common pitfalls like over-relying on retries and neglecting observability for robust, scalable architectures.

Rajesh Kumar Pandey
on May 30, 2025
Development

Binary Size Matters: the Challenges of Fitting Complex Applications in Storage-Constrained Devices

This article explores developing software for microcontrollers in C or C++, where constraints are the limited amount of volatile memory and the embedded hardware platform on which the software runs. It shows how to adopt languages like C++ while optimizing for binary size due to stringent hardware constraints, and trade off between runtime efficiency and binary size in architecture decisions.

Paulo Martinez
on May 16, 2025
Architecture & Design

Legacy Modernization: Architecting Real-Time Systems around a Mainframe

At its heart, our transformation journey is about breaking dependencies at multiple levels. Many enterprises face similar challenges with legacy systems: tightly coupled architectures that are difficult to scale, change, or maintain. For us at National Grid, the solution came through four complementary paradigms that worked together to enable different forms of decoupling.

Jason Roberts Sonia Mathew
on Apr 30, 2025
Development

How to Compute without Looking: a Sneak Peek into Secure Multi-Party Computation

This article shows how you can compute a function across multiple parties that do not trust each other without forcing them to share their individual inputs. This technique can be used to split secrets among parties, perform logical operations, or count votes in a way that ensures data privacy is preserved.

Debasish Ray Chawdhuri
on Mar 31, 2025
AI, ML & Data Engineering

Eclipse LMOS: Launching AI Agents across Europe at Breakneck Speed

In this talk, the authors share some of our company’s key learnings in developing customer-facing LLM-powered applications deployed across Europe. They used multi-agent architecture and systems design to create an open-source set of tools, a framework, and a full-fledged platform to accelerate the development of AI agents. This is a summary of a presentation from InfoQ Dev Summit Boston 2024.

Arun Joseph Patrick Whelan
on Feb 17, 2025
Architecture & Design

Transforming Legacy Healthcare Systems: a Journey to Cloud-Native Architecture

Discover how Livi navigated the complexities of transitioning MJog, a legacy healthcare system, to a cloud-native architecture, sharing valuable insights for successful tech modernization. Our experience illustrates that transitioning from legacy systems to cloud-based microservices is not a one-time project, but an ongoing journey.

Leander Vanderbijl
on Nov 18, 2024

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles