Spring REST API Client Flavors: From RestTemplate to RestClient
Integrating AI Into Test Automation Frameworks With the ChatGPT API
Kubernetes in the Enterprise
Over a decade in, Kubernetes is the central force in modern application delivery. However, as its adoption has matured, so have its challenges: sprawling toolchains, complex cluster architectures, escalating costs, and the balancing act between developer agility and operational control. Beyond running Kubernetes at scale, organizations must also tackle the cultural and strategic shifts needed to make it work for their teams.As the industry pushes toward more intelligent and integrated operations, platform engineering and internal developer platforms are helping teams address issues like Kubernetes tool sprawl, while AI continues cementing its usefulness for optimizing cluster management, observability, and release pipelines.DZone's 2025 Kubernetes in the Enterprise Trend Report examines the realities of building and running Kubernetes in production today. Our research and expert-written articles explore how teams are streamlining workflows, modernizing legacy systems, and using Kubernetes as the foundation for the next wave of intelligent, scalable applications. Whether you're on your first prod cluster or refining a globally distributed platform, this report delivers the data, perspectives, and practical takeaways you need to meet Kubernetes' demands head-on.
Getting Started With CI/CD Pipeline Security
Java Caching Essentials
Done well, knowledge base integrations enable AI agents to deliver specific, context-rich answers without forcing employees to dig through endless folders. Done poorly, they introduce security gaps and permissioning mistakes that erode trust. The challenge for software developers building these integrations is that no two knowledge bases handle permissions the same way. One might gate content at the space level, another at the page level, and a third at the attachment level. Adding to these challenges, permissions aren't static. They change when people join or leave teams, switch roles, or when content owners update visibility rules. If your integration doesn't mirror these controls accurately and in real time, you risk exposing the wrong data to the wrong person. In building these knowledge base integrations ourselves, we've learned lots of practical tips for how to build secure, maintainable connectors that shorten the time to deployment without cutting corners on data security. 1. Treat Permissions as a First-Class Data Type Too many integration projects prioritize syncing content over permissions. This approach is backwards. Before your AI agent processes a single page, it should understand the permission model of the source system and be able to represent it internally. This means: Mapping every relevant permission scope in the source system (space, folder, page, attachment, comment).Representing permissions in your data model so your AI agent can enforce them before returning a result.Designing for exceptions. For example, if an article is generally public within a department but contains one restricted attachment, your connector should respect that partial restriction. For example, in a Confluence integration, you should check both space-level and page-level rules for each request. If you cache content to speed up retrieval, you must also cache the permissions and invalidate them promptly when they change. 2. Sync Permissions as Often as Content Permissions drift quickly. Someone might be promoted, transferred, or removed from a sensitive project, and the content they previously accessed is suddenly off-limits. Your AI agent should never rely on a stale permission snapshot. A practical approach is to tie permission updates to the same sync cadence as content updates. If you're fetching new or updated articles every five minutes, refresh the associated access control lists (ACLs) on the same schedule. If the source system supports webhooks or event subscriptions for permission changes, use them to trigger targeted re-syncs. 3. Respect the Principle of Least Privilege in Responses Enforcing permissions also shapes what your AI agent returns. For example, say your AI agent receives the query, "What are the latest results from our employee engagement survey?" The underlying knowledge base contains a page with survey results visible only to HR and executives. Even if the query perfectly matches the page's content, the agent should respond with either no result or a message indicating that the content is restricted. This means filtering retrieved documents at query time based on the current user's identity and permissions, not just when content is first synced. Retrieval-augmented generation (RAG) pipelines need this filter stage before passing context to the LLM. 4. Normalize Data Without Flattening Security Every knowledge base stores content differently, whether that's nested pages in Confluence, blocks in Notion, or articles in Zendesk. Normalizing these formats makes it easier for your AI agent to handle multiple systems. But normalization should never strip away the original permission structures. For instance, when creating a unified search index, store both the normalized text and the original system's permission metadata. Your query service can then enforce the correct rules regardless of which source system the content came from. 5. Handle Hierarchies and Inheritance Carefully Most systems allow permission inheritance, where you grant access to a top-level space, and then all child pages inherit those rights unless overridden. Your connector must understand and replicate this logic. For example, with an internal help desk AI agent, a "VPN Troubleshooting" article may inherit view rights from its parent "Network Resources" space. But if someone restricts that one article to a smaller group, your integration must override the inherited rule and enforce the more restrictive setting. 6. Test With Realistic, Complex Scenarios Permission bugs often hide in edge cases: Mixed inheritance and explicit restrictionsUsers with multiple, overlapping rolesAttachments with different permissions than their parent page Developers should build a test harness that mirrors these conditions using anonymized or synthetic data. Validate not only that your AI agent can fetch the right content, but that it never exposes restricted data, even when queried indirectly ("What did the survey results say about the marketing team?"). 7. Build for Ongoing Maintenance A secure, reliable knowledge base integration isn't a "set it and forget it" feature. It's an active part of your AI agent's architecture. Once deployed, knowledge base integrations require constant upkeep: API version changes, evolving permission models, and shifts in organizational structure. Assign ownership for monitoring and updating each connector, and automate regression tests for permission enforcement. Document your mapping between source-system roles and internal permission groups so that changes can be made confidently when needed. By giving permissions the same engineering rigor as content retrieval, you protect sensitive data and preserve trust in the system. That trust is what ultimately allows these AI agents to be embedded into the real workflows where they deliver the most value. You may be looking at the steps involved in building knowledge base connectors and wonder why they matter. When implemented well, they can transform workflows: Enterprise AI search: By integrating with a company's wiki, CRM, and file storage, a search agent can answer multi-step queries like, "What's the status of the Acme deal?" pulling from sales notes, internal strategy docs, and shared project plans. Permissions ensure that deal details remain visible only to the account team.IT help desk agent: When connected to a knowledge base, the agent can deliver precise, step-by-step troubleshooting guides to employees. If a VPN setup page is restricted to IT staff, the agent won't surface it to non-IT users.New hire onboarding bot: Integrated with the company wiki and messaging platform, an agent can answer questions about policies, teams, and tools. Each answer is filtered through the same rules that would apply if the employee searched manually. These examples work not because the AI agent "knows everything," but because it knows how to retrieve the right things for the right person at the right time. As knowledge base products become the standard for AI agents, it's critical to manage integrations in a way that prioritizes data security and trust.
Many outdated or imprecise claims about transaction isolation levels in MongoDB persist. These claims are outdated because they may be based on an old version where multi-document transactions were introduced, MongoDB 4.0, such as the old Jepsen report, and issues have been fixed since then. They are also imprecise because people attempt to map MongoDB's transaction isolation to SQL isolation levels, which is inappropriate, as the SQL Standard definitions ignore Multi-Version Concurrency Control (MVCC), utilized by most databases, including MongoDB. Martin Kleppmann has discussed this issue and provided tests to assess transaction isolation and potential anomalies. I will conduct these tests on MongoDB to explain how multi-document transactions work and avoid anomalies. I followed the structure of Martin Kleppmann's tests on PostgreSQL and ported them to MongoDB. The read isolation level in MongoDB is controlled by the Read Concern, and the "snapshot" read concern is the only one comparable to other Multi-Version Concurrency Control SQL databases, and maps to Snapshot Isolation, improperly called Repeatable Read to use the closest SQL standard term. As I test on a single-node lab, I use "majority" to show that it does more than Read Committed. The write concern should also be set to "majority" to ensure that at least one node is common between the read and write quorums. Recap on Isolation Levels in MongoDB Let me explain quickly the other isolation levels and why they cannot be mapped to the SQL standard: readConcern: { level: "local" } is sometimes compared to Uncommitted Reads because it may show a state that can be later rolled back in case of failure. However, some SQL databases may show the same behavior in some rare conditions (example here) and still call that Read CommittedreadConcern: { level: "majority" } is sometimes compared to Read Committed, because it avoids uncommitted reads. However, Read Committed was defined for wait-on-conflict databases to reduce the lock duration in two-phase locking, but MongoDB multi-document transactions use fail-on-conflict to avoid waits. Some databases consider that Read Committed can allow reads from multiple states (example here) while some others consider it must be a statement-level snapshot isolation (examples here). In a multi-shard transaction, majority may show a result from multiple states, as snapshot is the one being timeline consistent.readConcern: { level: "snapshot" } is the real equivalent to Snapshot Isolation, and prevents more anomalies than Read Committed. Some databases even call that "serializable" (example here) because the SQL standard ignores the write-skew anomaly.readConcern: { level: "linearlizable" } is comparable to serializable, but for a single document, not available for multi-document transactions, similar to many SQL databases that do not provide serializable as it reintroduces scalability the problems of read locks, that MVCC avoids. Read Committed Basic Requirements (G0, G1a, G1b, G1c) Here are some tests for anomalies typically prevented in Read Committed. I'll run them with readConcern: { level: "majority" } but keep in mind that readConcern: { level: "snapshot" } may be better if you want a consistent snapshot across multiple shards. MongoDB Prevents Write Cycles (G0) With Conflict Error JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); In a two-phase locking database, with wait-on-conflict behavior, the second transaction would wait for the first one to avoid anomalies. However, MongoDB with transactions is fail-on-conflict and raises a retriable error to avoid the anomaly. Each transaction touched only one document, but it was declared explicitly with a session and startTransaction(), to allow multi-document transactions, and this is why we observed the fail-on-conflict behavior to let the application apply its retry logic for complex transactions. If the conflicting update was run as a single-document transaction, equivalent to an auto-commit statement, it would have used a wait-on-conflict behavior. I can test it by immediately running this while the t1 transaction is still active: JavaScript const db = db.getMongo().getDB("test_db"); print(`Elapsed time: ${ ((startTime = new Date()) && db.test.updateOne({ _id: 1 }, { $set: { value: 12 } })) && (new Date() - startTime) } ms`); Elapsed time: 72548 ms I've run the updateOne({ _id: 1 }) without an implicit transaction. It waited for the other transaction to terminate, which happened after a 60-second timeout, and then the update was successful. The first transaction that timed out is aborted: JavaScript session1.commitTransaction(); MongoServerError[NoSuchTransaction]: Transaction with { txnNumber: 2 } has been aborted. The behavior of conflict in transactions differs: wait-on-conflict for implicit single-document transactionsfail-on-conflict for explicit multiple-document transactions immediately, resulting in a transient error, without waiting, to let the application rollback and retry. MongoDB Prevents Aborted Reads (G1a) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 101 } }); T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] session1.abortTransaction(); T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] session2.commitTransaction(); MongoDB prevents reading an aborted transaction by reading only the committed value when Read Concern is 'majority' or 'snapshot.' MongoDB Prevents Intermediate Reads (G1b) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 101 } }); T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] The non-committed change from T1 is not visible to T2. JavaScript T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); session1.commitTransaction(); // T1 commits T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] The committed change from T1 is still not visible to T2 because it happened after T2 started. This is different from the majority of Multi-Version Concurrency Control SQL databases. To minimize the performance impact of wait-on-conflict, they reset the read time before each statement in Read Committed, as phantom reads are allowed. They would have displayed the newly committed value with this example. MongoDB never does that; the read time is always the start of the transaction, and no phantom read anomaly happens. However, it doesn't wait to see if the conflict is resolved or must fail with a deadlock, and fails immediately to let the application retry it. MongoDB Prevents Circular Information Flow (G1c) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 22 } }); T1.test.find({ _id: 2 }); [ { _id: 2, value: 20 } ] T2.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] session1.commitTransaction(); session2.commitTransaction(); In both transactions, the uncommitted changes are not visible to others. MongoDB Prevents Observed Transaction Vanishes (OTV) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T3 const session3 = db.getMongo().startSession(); const T3 = session3.getDatabase("test_db"); session3.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T1.test.updateOne({ _id: 2 }, { $set: { value: 19 } }); T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. This anomaly is prevented by fail-on-conflict with an explicit transaction. With an implicit single-document transaction, it would have to wait for the conflicting transaction to end. MongoDB Prevents Predicate-Many-Preceders (PMP) With a SQL database, this anomaly would require the Snapshot Isolation level because Read Committed uses different read times per statement. However, I can show it in MongoDB with 'majority' read concern, 'snapshot' being required only to get cross-shard snapshot consistency. JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ value: 30 }).toArray(); [] T2.test.insertOne( { _id: 3, value: 30 } ); session2.commitTransaction(); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [] The newly inserted row is not visible because it was committed by T2 after the start of T1. Martin Kleppmann's tests include some variations with a delete statement and a write predicate: JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateMany({}, { $inc: { value: 10 } }); T2.test.deleteMany({ value: 20 }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. As it is an explicit transaction, rather than blocking, the delete detects the conflict and raises a retriable exception to prevent the anomaly. Compared to PostgreSQL, which prevents that in Repeatable Read, it saves the waiting time before failure, but requires the application to implement a retry logic. MongoDB Prevents Lost Update (P4) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. As it is an explicit transaction, the update doesn't wait and raises a retriable exception, so that it is impossible to overwrite the other update without waiting for its completion. MongoDB Prevents Read Skew (G-single) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find({ _id: 2 }); [ { _id: 2, value: 20 } ] T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 18 } }); session2.commitTransaction(); T1.test.find({ _id: 2 }); [ { _id: 2, value: 20 } ] In SQL databases with Read Committed isolation, a read skew anomaly could display the value 18. However, MongoDB avoids this issue by reading the same value of 20 consistently throughout the transaction, as it reads data as of the start of the transaction. Martin Kleppmann's tests include a variation with predicate dependency: JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.findOne({ value: { $mod: [5, 0] } }); { _id: 1, value: 10 } T2.test.updateOne({ value: 10 }, { $set: { value: 12 } }); session2.commitTransaction(); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [] The uncommitted value 12 which is a multiple of 3 is not visible to the transaction that started before. Another test includes a variation with a write predicate in a delete statement: JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 18 } }); session2.commitTransaction(); T1.test.deleteMany({ value: 20 }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. This read skew anomaly is prevented by the fail-on-conflict behavior when writing a document that has uncommitted changes from another transaction. Write Skew (G2-item) Must Be Managed by the Application JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: { $in: [1, 2] } }) [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] T2.test.find({ _id: { $in: [1, 2] } }) [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] T2.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 21 } }); session1.commitTransaction(); session2.commitTransaction(); MongoDB doesn't detect the read/write conflict when one transaction has read a value updated by the other, and then writes something that may have depended on this value. The Read Concern doesn't provide the Serializable guarantee. Such isolation requires acquiring range or predicate locks during reads, and doing it prematurely would hinder the performance of a database designed to scale. For the transactions that need to avoid this, the application can transform the read/write conflict to a write/write conflict by updating a field in the document that was read to be sure that other transactions do not modify it. Or re-check the value when updating. Anti-Dependency Cycles (G2) Must Be Managed by the Application JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } }); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [] T2.test.find({ value: { $mod: [3, 0] } }).toArray(); [] T1.test.insertOne( { _id: 3, value: 30 } ); T1.test.insertOne( { _id: 4, value: 42 } ); session1.commitTransaction(); session2.commitTransaction(); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [ { _id: 3, value: 30 }, { _id: 4, value: 42 } ] The read/write conflict was not detected, and both transactions were able to write, even if they may have depended on a previous read that had been modified by the other transaction. MongoDB does not acquire locks across read and write calls. If you run a multi-document transaction where the writes depend on the reads, the application must explicitly write to the read set in order to detect the write conflict and avoid the anomaly. All those tests were based on https://github.com/ept/hermitage. There's a lot of information about MongoDB transactions in the MongoDB Multi-Document ACID Transactions whitepaper from 2020. While the document model offers simplicity and performance when a single document matches the business transaction, MongoDB supports multi-statement transactions with Snapshot Isolation, similar to many SQL databases using Multi-Version Concurrency Control (MVCC), but favoring fail-on-conflict rather than wait. Despite outdated myths surrounding NoSQL or based on old versions, its transaction implementation is robust and effectively prevents common transactional anomalies.
As you may have already guessed from the title, the topic for today will be Spring Boot WebSockets. Some time ago, I provided an example of WebSocket chat based on Akka toolkit libraries. However, this chat will have somewhat more features, and a quite different design. I will skip some parts so as not to duplicate too much content from the previous article. Here you can find a more in-depth intro to WebSockets. Please note that all the code that’s used in this article is also available in the GitHub repository. Spring Boot WebSocket: Tools Used Let’s start the technical part of this text with a description of the tools that will be further used to implement the whole application. As I cannot fully grasp how to build a real WebSocket API with classic Spring STOMP overlay, I decided to go for Spring WebFlux and make everything reactive. Spring Boot – No modern Java app based on Spring can exist without Spring Boot; all the autoconfiguration is priceless.Spring WebFlux – A reactive version of classic Spring, provides quite a nice and descriptive toolkit for handling both WebSockets and REST. I would dare to say that it is the only way to actually get WebSocket support in Spring.Mongo – One of the most popular NoSQL databases, I am using it for storing message history.Spring Reactive Mongo – Spring Boot starter for handling Mongo access in a reactive fashion. Using reactive in one place but not the other is not the best idea. Thus, I decided to make DB access reactive as well. Let’s start the implementation! Spring Boot WebSocket: Implementation Dependencies and Config pom.xml XML <dependencies> <!--Compile--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-webflux</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-mongodb-reactive</artifactId> </dependency> </dependencies> application.properties Properties files spring.data.mongodb.uri=mongodb://chats-admin:admin@localhost:27017/chats I prefer .properties over .yml — In my honest opinion, YAML is not readable and non-maintainable on a larger scale. WebSocketConfig Java @Configuration class WebSocketConfig { @Bean ChatStore chatStore(MessagesStore messagesStore) { return new DefaultChatStore(Clock.systemUTC(), messagesStore); } @Bean WebSocketHandler chatsHandler(ChatStore chatStore) { return new ChatsHandler(chatStore); } @Bean SimpleUrlHandlerMapping handlerMapping(WebSocketHandler wsh) { Map<String, WebSocketHandler> paths = Map.of("/chats/{id}", wsh); return new SimpleUrlHandlerMapping(paths, 1); } @Bean WebSocketHandlerAdapter webSocketHandlerAdapter() { return new WebSocketHandlerAdapter(); } } And surprise, all four beans defined here are very important. ChatStore – Custom bean for operating on chats, I will go into more details on this bean in the following steps.WebSocketHandler – Bean that will store all the logic related to handling WebSocket sessions.SimpleUrlHandlerMapping – Responsible for mapping URLs to correct handler full URL for this one will look more or less like this ws://localhost:8080/chats/{id}.WebSocketHandlerAdapter – A kind of capability bean it adds WebSockets handling support to Spring Dispatcher Servlet. ChatsHandler Java class ChatsHandler implements WebSocketHandler { private final Logger log = LoggerFactory.getLogger(ChatsHandler.class); private final ChatStore store; ChatsHandler(ChatStore store) { this.store = store; } @Override public Mono handle(WebSocketSession session) { String[] split = session.getHandshakeInfo() .getUri() .getPath() .split("/"); String chatIdStr = split[split.length - 1]; int chatId = Integer.parseInt(chatIdStr); ChatMeta chatMeta = store.get(chatId); if (chatMeta == null) { return session.close(CloseStatus.GOING_AWAY); } if (!chatMeta.canAddUser()) { return session.close(CloseStatus.NOT_ACCEPTABLE); } String sessionId = session.getId(); store.addNewUser(chatId, session); log.info("New User {} join the chat {}", sessionId, chatId); return session .receive() .map(WebSocketMessage::getPayloadAsText) .flatMap(message -> store.addNewMessage(chatId, sessionId, message)) .flatMap(message -> broadcastToSessions(sessionId, message, store.get(chatId).sessions()) .doFinally(sig -> store.removeSession(chatId, session.getId())) .then(); } private Mono broadcastToSessions(String sessionId, String message, List sessions) { return sessions .stream() .filter(session -> !session.getId().equals(sessionId)) .map(session -> session.send(Mono.just(session.textMessage(message)))) .reduce(Mono.empty(), Mono::then); } } As I mentioned above, here you can find all the logic related to handling WebSocket sessions. First, we parse the ID of a chat from the URL to get the target chat. Responding with different statuses depends on the context present for a particular chat. Additionally, I am also broadcasting the message to all the sessions related to particular chat — for users to actually exchange the messages. I have also added doFinally trigger that will clear closed sessions from the chatStore, to reduce redundant communication. As a whole, this code is reactive; there are some restrictions I need to follow. I have tried to make it as simple and readable as possible, if you have any idea how to improve it I am open. ChatsRouter Java @Configuration(proxyBeanMethods = false) class ChatRouter { private final ChatStore chatStore; ChatRouter(ChatStore chatStore) { this.chatStore = chatStore; } @Bean RouterFunction routes() { return RouterFunctions .route(POST("api/v1/chats/create"), e -> create(false)) .andRoute(POST("api/v1/chats/create-f2f"), e -> create(true)) .andRoute(GET("api/v1/chats/{id}"), this::get) .andRoute(DELETE("api/v1/chats/{id}"), this::delete); } } WebFlux's approach to defining REST endpoints is quite different from the classic Spring. Above, you can see the definition of 4 endpoints for managing chats. As similar as in the case of Akka implementation I want to have a REST API for managing Chats and WebSocket API for actual handling chats. I will skip the function implementations as they are pretty trivial; you can see them on GitHub. ChatStore First, the interface: Java public interface ChatStore { int create(boolean isF2F); void addNewUser(int id, WebSocketSession session); Mono addNewMessage(int id, String userId, String message); void removeSession(int id, String session); ChatMeta get(int id); ChatMeta delete(int id); Then the implementation: Java public class DefaultChatStore implements ChatStore { private final Map<Integer, ChatMeta> chats; private final AtomicInteger idGen; private final MessagesStore messagesStore; private final Clock clock; public DefaultChatStore(Clock clock, MessagesStore store) { this.chats = new ConcurrentHashMap<>(); this.idGen = new AtomicInteger(0); this.clock = clock; this.messagesStore = store; } @Override public int create(boolean isF2F) { int newId = idGen.incrementAndGet(); ChatMeta chatMeta = chats.computeIfAbsent(newId, id -> { if (isF2F) { return ChatMeta.ofId(id); } return ChatMeta.ofIdF2F(id); }); return chatMeta.id; } @Override public void addNewUser(int id, WebSocketSession session) { chats.computeIfPresent(id, (k, v) -> v.addUser(session)); } @Override public void removeSession(int id, String sessionId) { chats.computeIfPresent(id, (k, v) -> v.removeUser(sessionId)); } @Override public Mono addNewMessage(int id, String userId, String message) { ChatMeta meta = chats.getOrDefault(id, null); if (meta != null) { Message messageDoc = new Message(id, userId, meta.offset.getAndIncrement(), clock.instant(), message); return messagesStore.save(messageDoc) .map(Message::getContent); } return Mono.empty(); } // omitted The base of ChatStore is the ConcurrentHashMap that holds the metadata of all open chats. Most of the methods from the interface are self-explanatory, and there is nothing special behind them. create – Creates a new chat with a bool attribute denoting if the chat is f2f or group.addNewUser – Adds a new user to existing chats.removeUser – Removes a user from the existing chat.get – Gets the metadata of a chat with an ID.delete – Deletes the chat from CMH. The only complex method here is addNewMessages. It increments the message counter within the chat and persists message content in MongoDB, for durability. MongoDB Message Entity Java public class Message { @Id private String id; private int chatId; private String owner; private long offset; private Instant timestamp; private String content; A model for message content stored in a database, there are three important fields here: chatId – Represent chat in which a particular message was sent.ownerId – The userId of the message sender.offset – Ordinal number of message within the chat, for retrieval ordering. MessageStore Java public interface MessagesStore extends ReactiveMongoRepository<Message, String> {} Nothing special, classic Spring Repository, but in a reactive fashion, provides the same set of features as JpaRepository. It is used directly in ChatStore. Additionally, in the main application class, WebsocketsChatApplication, I am activating reactive repositories by using @EnableReactiveMongoRepositories. Without this annotation messageStore from above would not work. And here we go, we have the whole chat implemented. Let’s test it! Spring Boot WebSocket: Testing For tests, I’m using Postman and Simple WebSocket Client. I’m creating a new chat using Postman. In the response body, I got a WebSocket URL to the recently created chat. Now it is time to use them and check if users can communicate with one another. Simple Web Socket Client comes into play here. Thus, I am connecting to the newly created chat here. Here we are, everything is working, and users can communicate with each other. There is one last thing to do. Let’s spend a moment looking at things that can be done better. What Can Be Done Better As what I have just built is the most basic chat app, there are a few (or in fact quite a lot) things that may be done better. Below, I have listed the things I find worthy of improvement: Authentication and rejoining support – Right now, everything is based on the sessionId. It is not an optimal approach. It would be better to have some authentication in place and actual rejoining based on user data.Sending attachments – For now, the chat only supports simple text messages. While texting is the basic function of a chat, users enjoy exchanging images and audio files, too.Tests – There are no tests for now, but why leave it like this? Tests are always a good idea.Overflow in offset – Currency, it is a simple int. If we were to track the offset for a very long time, it would overflow sooner or later. Summary Et voilà! The Spring Boot WebSocket chat is implemented, and the main task is done. You have some ideas on what to develop in the next steps. Please keep in mind that this chat case is very simple, and it will require lots of changes and development for any type of commercial project. Anyway, I hope that you learned something new while reading this article. Thank you for your time. These other resources might interest you: Lock-Free Programming in Java7 API Integration Patterns
You think you know your SDLC like the back of your carpal-tunnel-riddled hand: You've got your gates, your reviews, your carefully orchestrated dance of code commits and deployment pipelines. But here's a plot twist straight out of your auntie's favorite daytime soap: there's an evil twin lurking in your organization (cue the dramatic organ music). It looks identical to your SDLC — same commits, same repos, the same shiny outputs flowing into production. But this fake-goatee-wearing doppelgänger plays by its own rules, ignoring your security governance and standards. Welcome to the shadow SDLC — the one your team built with AI when you weren't looking: It generates code, dependencies, configs, and even tests at machine speed, but without any of your governance, review processes, or security guardrails. Checkmarx’s August Future of Application Security report, based on a survey of 1,500 CISOs, AppSec managers, and developers worldwide, just pulled back the curtain on this digital twin drama: 34% of developers say more than 60% of their code is now AI-generated. Only 18% of organizations have policies governing AI use in development. 26% of developers admit AI tools are being used without permission. It’s not just about insecure code sneaking into production, but rather about losing ownership of the very processes you’ve worked to streamline. Your “evil twin” SDLC comes with: Unknown provenance → You can’t always trace where AI-generated code or dependencies came from. Inconsistent reliability → AI may generate tests or configs that look fine but fail in production. Invisible vulnerabilities → Flaws that never hit a backlog because they bypass reviews entirely. This isn’t a story about AI being “bad”, but about AI moving faster than your controls — and the risk that your SDLC’s evil twin becomes the one in charge. The rest of this article is about how to prevent that. Specifically: How the shadow SDLC forms (and why it’s more than just code)The unique risks it introduces to security, reliability, and governanceWhat you can do today to take back ownership — without slowing down your team How the Evil Twin SDLC Emerges The evil twin isn’t malicious by design — it’s a byproduct of AI’s infiltration into nearly every stage of development: Code creation – AI writes large portions of your codebase at scale. Dependencies – AI pulls in open-source packages without vetting versions or provenance. Testing – AI generates unit tests or approves changes that may lack rigor. Configs and infra – AI auto-generates Kubernetes YAMLs, Dockerfiles, Terraform templates. Remediation – AI suggests fixes that may patch symptoms while leaving root causes. The result is a pipeline that resembles your own — but lacks the data integrity, reliability, and governance you’ve spent years building. Sure, It’s a Problem. But Is It Really That Bad? You love the velocity that AI provides, but this parallel SDLC compounds risk by its very nature. Unlike human-created debt, AI can replicate insecure patterns across dozens of repos in hours. And the stats from the FOA report speak for themselves: 81% of orgs knowingly ship vulnerable code — often to meet deadlines. 33% of developers admit they “hope vulnerabilities won’t be discovered” before release. 98% of organizations experienced at least one breach from vulnerable code in the past year — up from 91% in 2024 and 78% in 2023. The share of orgs reporting 4+ breaches jumped from 16% in 2024 to 27% in 2025. That surge isn’t random. It correlates with the explosive rise of AI use in development. As more teams hand over larger portions of code creation to AI without governance, the result is clear: risk is scaling at machine speed, too. Taking Back Control From the Evil Twin You can’t stop AI from reshaping your SDLC. But you can stop it from running rogue. Here’s how: 1. Establish Robust Governance for AI in Development Whitelist approved AI tools with built-in scanning and keep a lightweight approval workflow so devs don’t default to Shadow AI. Enforce provenance standards like SLSA or SBOMs for AI-generated code. Audit usage & tag AI contributions — use CodeQL to detect AI-generated code patterns and require devs to mark AI commits for transparency. This builds reliability and integrity into the audit trail. 2. Strengthen Supply Chain Oversight AI assistants are now pulling in OSS dependencies you didn’t choose — sometimes outdated, sometimes insecure, sometimes flat-out malicious. While your team already uses hygiene tools like Dependabot or Renovate, they’re only table stakes that don’t provide governance. They won’t tell you if AI just pulled in a transitive package with a critical vulnerability, or if your dependency chain is riddled with license risks. That’s why modern SCA is essential in the AI era. It goes beyond auto-bumping versions to: Generate SBOMs for visibility into everything AI adds to your repos. Analyze transitive dependencies several layers deep. Provide exploitable-path analysis so you prioritize what’s actually risky. Auto-updaters are hygiene. SCA is resilience. 3. Measure and Manage Debt Velocity Track debt velocity — measure how fast vulnerabilities are introduced and fixed across repos. Set sprint-based SLAs — if issues linger, AI will replicate them across projects before you’ve logged the ticket. Flag AI-generated commits for extra review to stop insecure patterns from multiplying. Adopt Agentic AI AppSec Assistants — The FOA report highlights that traditional remediation cycles can’t keep pace with machine-speed risk, making autonomous prevention and real-time remediation a necessity, not a luxury. 4. Foster a Culture of Reliable AI Use Train on AI risks like data poisoning and prompt injection. Make secure AI adoption part of the “definition of done.” Align incentives with delivery, not just speed. Create a reliable feedback loop — encourage devs to challenge governance rules that hurt productivity. Collaboration beats resistance. 5. Build Resilience for Legacy Systems Legacy apps are where your evil twin SDLC hides best. With years of accumulated debt and brittle architectures, AI-generated code can slip in undetected. These systems were built when cyber threats were far less sophisticated, lacking modern security features like multi-factor authentication, advanced encryption, and proper access controls. When AI is bolted onto these antiquated platforms, it doesn't just inherit the existing vulnerabilities, but can rapidly propagate insecure patterns across interconnected systems that were never designed to handle AI-generated code. The result is a cascade effect where a single compromised AI interaction can spread through poorly-secured legacy infrastructure faster than your security team can detect it. Here’s what’s often missed: Manual before automatic: Running full automation on legacy repos without a baseline can drown teams in false positives and noise. Start with manual SBOMs on the most critical apps to establish trust and accuracy, then scale automation. Triage by risk, not by age: Not every legacy system deserves equal attention. Prioritize repos with heavy AI use, repeated vulnerability patterns, or high business impact. Hybrid skills are mandatory: Devs need to learn how to validate AI-generated changes in legacy contexts, because AI doesn’t “understand” old frameworks. A dependency bump that looks harmless in 2025 might silently break a 2012-era API. Conclusion: Bring the ‘Evil Twin’ Back into the Family The “evil twin” of your SDLC isn’t going away. It’s already here, writing code, pulling dependencies, and shaping workflows. The question is whether you’ll treat it as an uncontrolled shadow pipeline — or bring it under the same governance and accountability as your human-led one. Because in today’s environment, you don’t just own the SDLC you designed. You also own the one AI is building — whether you control it or not. Interested to learn more about SDLC challenges in 2025 and beyond? More stats and insights are available in the Future of Appsec report mentioned above.
GitHub Copilot agent mode had several enhancements in VS Code as part of its July 2025 release, further bolstering its capabilities. The supported LLMs are getting better iteratively; however, both personal experience and academic research remain divided on future capabilities and gaps. I've had my own learnings exploring agent mode for the last few months, ever since it was released, and had the best possible outcomes with Claude Sonnet Models. After 18 years of building enterprise systems — ranging from integrating siloed COTS to making clouds talk, architecting IoT telemetry data ingestions and eCommerce platforms — I've seen plenty of "revolutionary" tools come and go. I've watched us transition from monoliths to microservices, from on-premises to cloud, from waterfall to agile. I've learned Java 1.4, .NET 9, and multiple flavors of JavaScript. Each transition revealed fundamental flaws in how we think about software construction. The integration of generative AI into software engineering is dominated by pattern matching and reasoning by analogy to past solutions. This approach is philosophically and practically flawed. There's active academic research that surfaces this problem, primarily the "Architectures of Error" framework that systematically differentiates the failure modes of human and AI-generated code. At the moment, I'm neither convinced by Copilot's capability nor have I found reasons to hate it. My focus in this article is more on the human-side errors that Agent Mode helps us recognize. Why This Isn't Just Another AI Tool Copilot's Agent Mode isn't just influencing how we build software — it's revealing why our current approaches are fundamentally flawed. The uncomfortable reality: Much of our architectural complexity exists because we've never had effective ways to encode and enforce design intent. We write architectural decision records that few read. We create coding standards that get violated under pressure. We design patterns that work beautifully when implemented correctly but fail catastrophically when they're not. Agent Mode surfaces this gap between architectural intent and implementation reality in ways we haven't experienced before. The Constraint Problem We've Been Avoiding Here's something I've learned from working on dozens of enterprise projects: Most architectural failures aren't technical failures — they're communication failures. We design a beautiful hexagonal architecture, document it thoroughly, and then watch as business pressure, tight deadlines, and human misunderstanding gradually erode it. By year three, what we have bears little resemblance to what we designed. C# // What we designed public class CustomerService : IDomainService<Customer> { // Clean separation, proper dependencies } // What we often end up with after several iterations public class CustomerService { // Direct database calls mixed with business logic // Scattered validation, unclear responsibilities // Works, but violates every architectural principle } Agent Mode forces us to confront this differently. AI can't read between the lines or make intuitive leaps. If our architectural constraints aren't explicit enough for an AI to follow, they probably aren't explicit enough for humans either. The Evolution from Documentation to Constraints In my experience, the most successful architectural approaches have moved progressively toward making correct usage easy and incorrect usage difficult. Early in my career, I relied heavily on documentation and code reviews. Later, I discovered the power of types, interfaces, and frameworks that guide developers toward correct implementations. Now, I'm exploring how to encode architectural knowledge directly into development tooling (and Copilot). C# / Evolution 1: Documentation-based (fragile) // "Please ensure all controllers inherit from BaseApiController" // Evolution 2: Framework-based (better) public abstract class BaseApiController : ControllerBase { // Common functionality, but still optional } // Evolution 3: Constraint-based (AI-compatible) public interface IApiEndpoint<TRequest, TResponse> where TRequest : IValidated where TResponse : IResult { // Impossible to create endpoints that bypass validation } The key insight: Each evolution makes architectural intent more explicit and mechanical. Agent Mode simply pushes us further along this path. We can work around most AI problems like the "AI 90/10 problem" arising from hallucinated APIs, non-existent libraries, context-window myopia, systematic pattern propagation, and model drift. LLM responses are probabilistic by nature, but they can be made deterministic by specifying constraints. Practical Implications Working with Agent Mode on real projects has revealed several practical patterns: 1. Requirement Specification Vague prompts produce (architecturally) inconsistent results. This isn't a limitation — it's feedback about the clarity of our thinking at any role, especially around SDLC, including the architect. We struggled with the same problems with the advent of the outsourcing era, too. SaaS inherits this problem through its extensibility and flexibility. Markdown [BAD] Inviting infinite possibilities: "Create a service for managing customers relationship" [GOOD] More effective: "Create a CustomerService implementing IDomainService<Customer> with validation using FluentValidation and error handling via Result<T> pattern" 2. The Composability Test If AI struggles to combine your architectural patterns correctly, human developers probably do too. They excel at pattern matching but fail at: Systematicity: Applying rules consistently across contextsProductivity: Scaling to larger, more complex compositionsSubstitutivity: Swapping components while maintaining correctnessLocalism: Understanding global vs. local scope implications This also helps to identify the architectural complexity. 3. The Constraint Discovery Process Working with AI has helped me identify implicit assumptions in existing architectures that weren't previously explicit. These discoveries often lead to better human-to-human communication as well. The Skills That Remain Valuable Based on my experience so far, certain architectural skills have become more important now: Domain understanding: AI can generate technically correct code, but understanding business context and constraints remains fundamentally human.Pattern recognition: Identifying when existing patterns apply and when new ones are needed becomes crucial for defining AI constraints.System thinking: Understanding emergent behaviors and system-level properties remains beyond current AI capabilities.Trade-off analysis: Evaluating architectural decisions based on business context, team capabilities, and long-term maintainability. What's Actually Changing The shift isn't as dramatic as "AI replacing architects or developers." It's more subtle: From implementation to intent: Less time writing boilerplate, more time clarifying what we actually want the system to do.From review to prevention: Instead of catching architectural violations in code review, we encode constraints that prevent them upfront.From documentation to automation: Architectural knowledge becomes executable rather than just descriptive. These changes feel significant to me, but they're evolutionary rather than revolutionary. Challenges I'm Still Working Through The learning curve: Developing fluency with constraint-driven development requires rethinking established habits.Team adoption: Not everyone is comfortable with AI-assisted development yet, and that's understandable.Tool maturity: Current AI tools are impressive but still have limitations around context understanding and complex reasoning.Validation strategies: Traditional testing approaches may not catch all AI-generated issues, so we're developing new validation patterns. A Measured Prediction Based on what I'm seeing, I expect a gradual shift over the next 3–5 years toward: More explicit architectural constraints in codebasesIncreased automation of pattern enforcementEnhanced focus on domain modeling and business rule specificationEvolution of code review practices to emphasize architectural composition over implementation details This won't happen overnight, and it won't replace fundamental architectural thinking. But it will change how we express and enforce architectural decisions. What I'm Experimenting With Currently, I'm exploring: 1. Machine-readable architecture definitions that can guide both AI and human developers. JSON { "architecture": { "layers": ["Api", "Application", "Domain", "Infrastructure"], "dependencies": { "Api": ["Application"], "Application": ["Domain"], "Infrastructure": ["Domain"] }, "patterns": { "cqrs": { "commands": "Application/Commands", "queries": "Application/Queries", "handlers": "required" } } } } 2. Architectural testing frameworks that validate system composition automatically. C# [Test] public void Architecture_Should_Enforce_Layer_Dependencies() { var result = Types.InCurrentDomain() .That().ResideInNamespace("Api") .ShouldNot().HaveDependencyOn("Infrastructure") .GetResult(); Assert.That(result.IsSuccessful, result.FailingTypes); } [Test] public void AI_Generated_Services_Should_Follow_Naming_Conventions() { var services = Types.InCurrentDomain() .That().AreClasses() .And().ImplementInterface(typeof(IDomainService)) .Should().HaveNameEndingWith("Service") .GetResult(); Assert.That(services.IsSuccessful); } 3. Constraint libraries that make common patterns easy to apply correctly, starting with domain primitives. C# ```csharp // Instead of generic controllers, define domain-specific primitives public abstract class DomainApiController<TEntity, TDto> : ControllerBase where TEntity : class, IEntity where TDto : class, IDto { // Constrained template that AI can safely compose } // Service registration primitive public static class ServiceCollectionExtensions { public static IServiceCollection AddDomainService<TService, TImplementation>( this IServiceCollection services) where TService : class where TImplementation : class, TService { // Validated, standard registration pattern return services.AddScoped<TService, TImplementation>(); } } 4. Documentation approaches that work well with AI-assisted development. An example is documenting architecture in the Arc42 template in Markdown, diagrams in Mermaid embedded in Markdown. Early results are promising, but there's still much to learn and explore. Looking Forward After 18 years in this field, I've learned to be both optimistic about new possibilities and realistic about the pace of change. VS Code Agent Mode represents an interesting step forward in human-AI collaboration for software development. It's not a silver bullet, but it is a useful tool that can help us build better systems — if we approach it thoughtfully. The architectures that thrive in an AI-assisted world won't necessarily be the most sophisticated ones. They'll be the ones that most clearly encode human insight in ways that both AI and human developers can understand and extend. That's a worthy goal, regardless of the tools we use to achieve it. Final Thoughts The most valuable architectural skill has always been clarity of thought about complex systems. AI tools like Agent Mode don't change this fundamental requirement — they just give us new ways to express and validate that clarity. As we navigate this transition, the architects who succeed will be those who remain focused on the essential questions: What are we trying to build? Why does it matter? How can we make success more likely than failure? The tools continue to evolve, but these questions remain constant. I'm curious about your experiences with AI-assisted development. What patterns are you seeing? What challenges are you facing? The best insights come from comparing experiences across different contexts and domains.
It is not uncommon for back-end software to have a configuration file to start up with. These are generally YAML or JSON files, which are loaded by the system while starting up, and are then used to set up initial configuration for a system. Values included here may affect business logic or infrastructure. Let us create a new service called DumplingSale (because I love dumplings, or as we call them, momos). This service is used for managing the sales of dumplings. As an example, take a look at this YAML file, used to start a service called DumplingSale. Java # Production configuration for DumplingSale Java web application # prod.yaml redis: host: redis-prod.example.com port: 6379 password: ${REDIS_PASSWORD} timeout: 3000ms logging: level: com.example.dumplingsale: INFO org.springframework.web: WARN file: name: /var/log/dumpling-sale/application.log max-size: 100MB max-history: 30 # Section we might want to change dynamically dumpling-sale-config: max-orders-per-minute: 100 order-timeout-minutes: 30 enable-analytics: true payment-provider-config: active-provider: "your-payment-provider.com" transaction-timeout-seconds: 10 retry-attempts: 3 default-currency: "USD" api-version: "2024-05-26" enable-3ds-secure: true webhook-verification-enabled: true In the above section, let’s say we wanted to change our dumpling sale section dynamically. This could involve changing the order timeout minutes to 15 in times of increased sales, or reducing max orders per minute if the kitchen is backed up. In a default static configuration system, the static configuration will have to be changed by making a code change and then deploying it over our servers again. This would likely involve a restart of your servers. If there is a separation of code and config, you could possibly keep the code the same, but the servers would need to be restarted. However, in a dynamic configuration system, we could change the config at one place and have it changed in all our servers. Configuration Use Cases AllowLists and blocklists: Configs can allow you to manage allowlists or blocklists, and dynamically update them as your service runs.Performance tuning: You can change the number of threads, timeouts, workers, endpoints, etc., without having to restart your application.Flags: Think of any flags you pass to your application, and you could change them dynamically. In this article, we will follow the above Dumpling Store example and modify the payment provider and the dumpling sale config dynamically. Types of Config Delivery There are broadly two types of config delivery: push or pull. Push config delivery: In this system, the config mechanism delivers the configuration to all applications using the same mechanism. Pull config delivery: In this system, the config mechanism waits and responds with configuration when polled by your application. In this example, we will be using a pull delivery system. Data Structures We will have a parent data structure called dynamic configuration, but we will have child data structures for each config type you wish to support. I will be using Java to explain the example here, but feel free to use a language of your choice. Java import lombok.Data; import lombok.Builder; import java.util.List; import java.util.ArrayList; // Lombok Annotations @Data @Builder public class DumplingSaleConfig { private int maxOrdersPerMinute; private int orderTimeoutMinutes; private boolean enableAnalytics; } @Data @Builder public class PaymentProviderConfig { private String activeProvider; private int transactionTimeoutSeconds; private int retryAttempts; private String defaultCurrency; private String apiVersion; private boolean enable3dsSecure; private boolean webhookVerificationEnabled; } Creating the Cache Similarly, you will need to create a cache to fetch these configs. We are using the Guava Cache. Java public class AppConfigManager { private final LoadingCache<String, AppConfigData> appConfigCache; private final Yaml yaml; public AppConfigManager(AWSAppConfig awsAppConfig) { this.yaml = new Yaml(); this.awsAppConfig = awsAppConfig; this.appConfigCache = CacheBuilder.newBuilder() .refreshAfterWrite(5, TimeUnit.MINUTES) .build(new CacheLoader<String, AppConfigData>() { @Override public AppConfigData load(String key) throws Exception { return fetchConfigFromAppConfig(); } }); } Loading Contents for the Cache As you can see above, we created a cache that would return data of the type AppConfigData. However, the cache needs to fetch this data from somewhere as well, right? So, we would program a data source for the same, which allows dynamic configuration data to be loaded. Here are your options: Remote file: A remote file pulled by the servers. Could be stored in AWS S3, GCS, or any other object or file storage system you may have access to. Pros: Fast and easy deploymentYour object/file system may offer version history and audit logs.Cons: Not great tracking of versions, comparison of config across versions.A remote database Pros: All databases come with a great set of libraries and tools to integrate easily.Cons: Not great tracking of versions, comparison of config across versions.Depending on the database, unlikely to have auditing.Unless a custom version-based solution is created, no versioning.[Recommended] Cloud Config management system: Such as AWS Appconfig, Azure App Configuration, or GCP Firebase Remote Config. Pros: Fast, standardized rollout mechanisms: Can use both push and pull methods with slow/fast rollout across your service.Deploy changes across a variety of targets together, including compute, containers, docs, mobile applications, and serverless applications.Cons: You need to read the rest of this article to know how to use these. Creating Configuration Using AWS AppConfig Let’s use AWS Appconfig for this example. All the cloud solutions are great services, and we need just one to learn how to create this config. The above is a diagram from AWS AppConfig, which talks about the various steps you can take to ensure config deployments are safe and stable. 1. Choose config type: AWS allows you to use feature flags or freeform configs. We will choose freeform configs in this example to simplify config creation. 2. Choose a config name: my-config (I am keeping it simple.) 3. Choose config source: YAML dumpling_sale_config: maxOrdersPerMinute: 100 orderTimeoutMinutes: 30 enableAnalytics: true payment_provider_config: activeProvider: "your-payment-provider" transactionTimeoutSeconds: 10 retryAttempts: 3 defaultCurrency: "USD" apiVersion: "2024-05-26" enable3dsSecure: true Simply save this to an application (to keep it simple here: my-application). Now, let's save and deploy: As you can see above, I made the following choices: Environment: I created an environment called prod. Feel free to create as many as you need.Hosted config version: Right now, we will have only one version. In the future, you can choose to change the ‘latest’ version and deploy whichever version you would like.Deployment strategy: This is crucial. For simplicity, I have chosen ‘All at once.’ However, that is not always the best strategy, as you may want to roll out slowly and observe how your service is performing and roll back if necessary. You can read about other strategies here. Once your deployment is complete, the configuration will be deployable. Fetching Configuration Using AWS AppConfig Java private AppConfigData fetchConfigFromAppConfig() throws Exception { // 1. Start a configuration session to get a token StartConfigurationSessionRequest sessionRequest = new StartConfigurationSessionRequest() .withApplicationIdentifier("my-application") .withConfigurationProfileIdentifier("my-config") .withEnvironmentIdentifier("prod") .withRequiredMinimumPollIntervalInSeconds(30); // Recommended to set a minimum poll interval StartConfigurationSessionResult sessionResult = awsAppConfig.startConfigurationSession(sessionRequest); this.configurationToken = sessionResult.getInitialConfigurationToken(); // 2. Get the latest configuration using the token GetLatestConfigurationRequest configRequest = new GetLatestConfigurationRequest() .withConfigurationToken(configurationToken); GetLatestConfigurationResult configResult = awsAppConfig.getLatestConfiguration(configRequest); ByteBuffer configurationContent = configResult.getConfiguration(); if (configurationContent == null) { throw new IOException("No configuration content received from AWS AppConfig."); } // 3. Convert ByteBuffer to String String fatYaml = IOUtils.toString(configurationContent.asInputStream(), StandardCharsets.UTF_8); // 4. Parse YAML into the AppConfigData class return yaml.loadAs(fatYaml, AppConfigData.class); } Bringing It All Together Java import lombok.AllArgsConstructor; import lombok.Builder; import lombok.Data; import lombok.NoArgsConstructor; import com.google.common.cache.CacheBuilder; import com.google.common.cache.CacheLoader; import com.google.common.cache.LoadingCache; import org.yaml.snakeyaml.Yaml; import com.amazonaws.services.appconfig.AWSAppConfig; import com.amazonaws.services.appconfig.model.GetLatestConfigurationRequest; import com.amazonaws.services.appconfig.model.StartConfigurationSessionRequest; import com.amazonaws.services.appconfig.model.StartConfigurationSessionResult; import com.amazonaws.services.appconfig.model.GetLatestConfigurationResult; import com.amazonaws.util.IOUtils; // For converting ByteBuffer to String import java.nio.ByteBuffer; import java.util.concurrent.TimeUnit; import java.io.IOException; import java.nio.charset.StandardCharsets; @Data @Builder @NoArgsConstructor @AllArgsConstructor public class PaymentProviderConfig { private String activeProvider; private int transactionTimeoutSeconds; private int retryAttempts; private String defaultCurrency; private String apiVersion; private boolean enable3dsSecure; private boolean webhookVerificationEnabled; } @Data @Builder @NoArgsConstructor @AllArgsConstructor public class DumplingSaleConfig { private int maxOrdersPerMinute; private int orderTimeoutMinutes; private boolean enableAnalytics; } @Data @Builder @NoArgsConstructor @AllArgsConstructor public class AppConfigData { private DumplingSaleConfig dumplingSaleConfig; private PaymentProviderConfig paymentProviderConfig; } public class AppConfigManager { private final LoadingCache<String, AppConfigData> appConfigCache; private final Yaml yaml; private final AWSAppConfig awsAppConfig; private String configurationToken; // To store the session token for subsequent fetches public AppConfigManager(AWSAppConfig awsAppConfig) { this.yaml = new Yaml(); this.awsAppConfig = awsAppConfig; this.appConfigCache = CacheBuilder.newBuilder() .refreshAfterWrite(5, TimeUnit.MINUTES) .build(new CacheLoader<String, AppConfigData>() { @Override public AppConfigData load(String key) throws Exception { return fetchConfigFromAppConfig(); } }); } private AppConfigData fetchConfigFromAppConfig() throws Exception { // 1. Start a configuration session to get a token StartConfigurationSessionRequest sessionRequest = new StartConfigurationSessionRequest() .withApplicationIdentifier("my-application") .withConfigurationProfileIdentifier("my-config") .withEnvironmentIdentifier("prod") .withRequiredMinimumPollIntervalInSeconds(30); // Recommended to set a minimum poll interval StartConfigurationSessionResult sessionResult = awsAppConfig.startConfigurationSession(sessionRequest); this.configurationToken = sessionResult.getInitialConfigurationToken(); // 2. Get the latest configuration using the token GetLatestConfigurationRequest configRequest = new GetLatestConfigurationRequest() .withConfigurationToken(configurationToken); GetLatestConfigurationResult configResult = awsAppConfig.getLatestConfiguration(configRequest); ByteBuffer configurationContent = configResult.getConfiguration(); if (configurationContent == null) { throw new IOException("No configuration content received from AWS AppConfig."); } // 3. Convert ByteBuffer to String String fatYaml = IOUtils.toString(configurationContent.asInputStream(), StandardCharsets.UTF_8); // 4. Parse YAML into the AppConfigData class return yaml.loadAs(fatYaml, AppConfigData.class); } public AppConfigData getAppConfig() { try { return appConfigCache.get("appConfig"); } catch (Exception e) { System.err.println("Error loading app config: " + e.getMessage()); return null; } } } Using the Config Now let’s say you had to check if the orders per minute metric was breached and based on the same, you would take a decision, you could simply use this config manager to get details on the order. Java import lombok.AllArgsConstructor; @AllArgsConstructor public class OrderRateLimiter { private final AppConfigManager appConfigManager; public boolean isOrderLimitExceeded(int currentOrdersThisMinute) { AppConfigData appConfig = appConfigManager.getAppConfig(); if (appConfig == null || appConfig.getDumplingSaleConfig() == null) { System.err.println("DumplingSaleConfig not available from AppConfigManager."); return false; } DumplingSaleConfig config = appConfig.getDumplingSaleConfig(); return currentOrdersThisMinute > config.getMaxOrdersPerMinute(); } } As you see above, you can simply fetch the config you want, without worrying about where it is coming from (cache/AWS Appconfig), and can make decisions on the basis of the same. Key Takeaways Using the LoadingCache allows for : Faster retrieval.Thread safety, since the cache handles its own refresh logic, and any number of calls to the cache can be easily handled.Hands off management for value refresh.Low cost even with very high retrievals: as an example, let’s say you have a 100 servers running the application, needing a config 500 times per second, you will only still be billed for 100*12 = 1200 requests per hour since you are refreshing the cache every 5 minutes, as opposed to 100 * 500 * 3600 = 180 million requests if you didn’t have a cache.Low network utilization since requests are locally served.Higher availability, in case the config service is down. While using Cloud-based config management systems allows for: Easier management of config lifecycle.Better rollout strategies.Centralized management. Now, you are ready to create your own distributed cloud-based dynamic configurations.
Keeping track of AWS spend is very important. Especially since it’s so easy to create resources, you might forget to turn off an EC2 instance or container you started, or remove a CDK stack for a specific experiment. Costs can creep up fast if you don’t put guardrails in place. Recently, I had to set up budgets across multiple AWS accounts for my team. Along the way, I learned a few gotchas (especially around SNS and KMS policies) that weren’t immediately clear to me as I started out writing AWS CDK code. In this post, we’ll go through how to: Create AWS Budgets with AWS CDKSend notifications via email and SNSHandle cases like encrypted topics and configuring resource policies If you’re setting up AWS Budgets for the first time, I hope this post will save you some trial and error. What Are AWS Budgets? AWS Budgets is part of AWS Billing and Cost Management. It lets you set guardrails for spend and usage limits. You can define a budget around cost, usage, or even commitment plans (like Reserved Instances and Savings Plans) and trigger alerts when you cross a threshold. You can think of Budgets as your planned spend tracker. Budgets are great for: Alerting when costs hit predefined thresholds (e.g., 80% of your budgeted spend)Driving team accountability by tying alerts to product or account ownersEnforcing a cap on monthly spend, triggering an action, and shutting down compute (EC2), if you go over budget (be careful with this) Keep in mind that budgets and their notifications are not instant. AWS billing data is processed multiple times a day, but you might trigger your budget a couple of hours after you’ve passed your threshold. This is clearly stated in the AWS documentation as: AWS billing data, which Budgets uses to monitor resources, is updated at least once per day. Keep in mind that budget information and associated alerts are updated and sent according to this data refresh cadence. Defining Budgets With AWS CDK You can create different kinds of budgets, depending on your requirements. Some examples are: Fixed budgets: Set one amount to monitor every budget period.Planned budgets: Set different amounts to monitor each budget period.Auto-adjusting budgets: Set a budget amount to be adjusted automatically based on the spending pattern over a time range that you specify. We’ll start with a simple example of how you can create a budget in the CDK. We’ll go for a fixed budget of about $100. The AWS CDK currently only has Level 1 constructs available for budgets, which means that the classes in the CDK are a 1 to 1 mapping to the CloudFormation resources. Because of this, you will have to explicitly define all required properties (constructs, IAM policies, resource policies, etc), which otherwise could be taken care of by a CDK L2 construct. It also means your CDK code will be a bit more verbose. We’ll start by using the CfnBudget construct. TypeScript new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' } } In the above example, we’ve created a budget with a limit of $100 per month. A budget alone isn’t very useful. You’d still have to check into the AWS console manually to see what your spend is compared to your budget. The important thing is that we want to get notified in case we reach our budget or our forecasted budget reaches our threshold, so let’s add a notification and a subscriber. TypeScript new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' }, notificationsWithSubscribers: [{ notification: { comparisonOperator: 'GREATER_THAN', notificationType: 'FORECASTED', threshold: 100, thresholdType: 'PERCENTAGE' }, subscribers: [{ subscriptionType: 'EMAIL', address: '<your-email-address>' }] }] }); Based on the notification settings, interested parties are notified when the spend is forecasted to exceed 100% of our defined budget limit. You can put a notification on forecasted or actual percentages. When that happens, an email is sent to the designated email address. Subscribers, at the time of writing, can be either email recipients or a Simple Notification Service (SNS) topic. In the above code example, we use email subscribers for which you can add up to 10 recipients. Depending on your team or organization, it might be beneficial to switch to using an SNS topic. The advantage of using an SNS topic over a set of email subscribers is that you can add different kinds of subscribers (email, chat, custom lambda functions) to your SNS topic. With an SNS topic, you have a single place to configure subscribers, and if you change your mind, you can do so in one place instead of updating all budgets. Using an SNS Topic also allows you to push budget notifications to, for instance, a chat client like MS Teams or Slack. In this case, we will make use of SNS in combination with email subscribers. Let’s start by defining an SNS topic with the AWS CDK. TypeScript // Create a topic for email notifications let topic = new Topic(this, 'budget-notifications-topic', { topicName: 'budget-notifications-topic' }); Now, let’s add an email subscriber, as this is the simplest way to receive budget notifications. TypeScript // Add email subscription topic.addSubscription( new EmailSubscription("your-email-address")); This looks pretty straightforward, and you might think you’re done, but there is one important step to take next, which I initially forgot. The AWS budgets service will need to be granted permissions to publish messages to the topic. To be able to do this, we will need to add a resource policy to the topic that allows the budgets service to call the SNS:Publish action for our topic. TypeScript // Add resource policy to allow the budgets service to publish to the SNS topic topic.addToResourcePolicy(new PolicyStatement({ actions:["SNS:Publish"], effect: Effect.ALLOW, principals: [new ServicePrincipal("budgets.amazonaws.com")], resources: [topic.topicArn], conditions: { ArnEquals: { 'aws:SourceArn': `arn:aws:budgets::${Stack.of(this).account}:*`, }, StringEquals: { 'aws:SourceAccount': Stack.of(this).account, }, }, })) Now, let’s assign the SNS topic as a subscriber in our CDK code. TypeScript // Define a fixed budget with SNS as subscriber new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' }, notificationsWithSubscribers: [{ notification: { comparisonOperator: 'GREATER_THAN', notificationType: 'FORECASTED', threshold: 100, thresholdType: 'PERCENTAGE' }, subscribers: [{ subscriptionType: 'SNS', address: topic.topicArn }] }] }); Working With Encrypted Topics If you have an SNS topic with encryption enabled (via KMS), you will need to make sure that the corresponding service has access to the KMS key. If you don’t, you will not get any messages, and as far as I could tell, you will see no errors (at least I could find none in CloudTrail). I actually wasted a couple of hours trying to figure this part out. I should have read the documentation, as it is explicitly stated to do so. I guess I should start with the docs instead of diving right into the AWS CDK code. TypeScript // Create KMS key used for encryption let key = new Key(this,'sns-kms-key', { alias: 'sns-kms-key', enabled: true, description: 'Key used for SNS topic encryption' }); // Create topic and assign the KMS key let topic = new Topic(this, 'budget-notifications-topic', { topicName: 'budget-notifications-topic', masterKey: key }); Now, let’s add the resource policy to the key and try to trim down the permissions as much as possible. TypeScript // Allow access from budgets service key.addToResourcePolicy(new PolicyStatement({ effect: Effect.ALLOW, actions: ["kms:GenerateDataKey*","kms:Decrypt"], principals: [new ServicePrincipal("budgets.amazonaws.com")], resources: ["*"], conditions: { StringEquals: { 'aws:SourceAccount': Stack.of(this).account, }, ArnLike: { "aws:SourceArn": "arn:aws:budgets::" + Stack.of(this).account +":*" } } })); Putting It All Together If you’ve configured everything correctly and deployed your stack to your target account, you should be good to go. Once you cross your threshold, you should be notified by email that your budget is exceeding one of your thresholds (depending on the threshold set). Summary In this post, we explored how to create AWS Budgets with AWS CDK and send notifications through email or SNS. Along the way, we covered some important topics like: Budgets alone aren’t useful until you add notifications.SNS topics need a resource policy so the Budgets service can publish.Encrypted topics require KMS permissions for the Budgets service. With these pieces in place, you’ll have a setup that alerts your team when costs exceed thresholds via email, chat, or custom integrations. A fully working CDK application with the code mentioned in this blog post can be found in the following GitHub repo.
Building on what we started earlier in an earlier article, here we’re going to learn how to extend our platform and create a platform abstraction for provisioning an AWS EKS cluster. EKS is AWS’s managed Kubernetes offering. Quick Refresher Crossplane is a Kubernetes CRD-based add-on that abstracts cloud implementations and lets us manage Infrastructure as code. Prerequisites Set up Docker Kubernetes.Follow the Crossplane installation based on the previous article.Follow the provider configuration based on the previous article.Apply all the network YAMLs from the previous article (including the updated network composition discussed later). This will create the necessary network resources for the EKS cluster. Some Plumbing When creating an EKS cluster, AWS needs to: Spin up the control plane (managed by AWS)Attach security groups Configure networking (ENIs, etc)Access the VPC and subnetsManage API endpointsInteract with other AWS services (e.g., CloudWatch for logging, Route53) To do this securely, AWS requires an IAM role that it can assume. We create that role here and reference it during cluster creation; details are provided below. Without this role, you'll get errors like "access denied" when creating the cluster. Steps to Create the AWS IAM Role Log in to the AWS Console and go to the IAM creation page.In the left sidebar, click RolesClick Create Role.Choose AWS service as the trusted entity type.Select the EKS use case, and choose the EKS Cluster.Attach the following policies: AmazonEKSClusterPolicyAmazonEKSServicePolicyAmazonEC2FullAccessAmazonEKSWorkerNodePolicyAmazonEC2ContainerRegistryReadOnlyAmazonEKS_CNI_PolicyProvide the name eks-crossplane-cluster and optionally add tags. Since we'll also create NodeGroups, which require additional permissions, for simplicity, I'm granting the Crossplane user (created in the previous article) permission to PassRole for the Crossplane cluster role, and this permission allows this user to tell AWS services (EKS) to assume the Crossplane cluster role on its behalf. Basically, this user can say, "Hey, EKS service, create a node group and use this role when doing it." To accomplish this, add the following inline policy to the Crossplane user: JSON { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::914797696655:role/eks-crossplane-clsuter" } ] } Note: Typically, to follow the principle of Least Privilege, you should separate roles with policies: Control plane role with EKS admin permissionsNode role with permissions for node group creation. In the previous article, I had created only one subnet in the network composition, but the EKS control plane requires at least two AZs, with one subnet per AZ. You should modify the network composition from the previous article to add another subnet. To do so, just add the following to the network composition YAML, and don't forget to apply the composition and claim to re-create the network. YAML - name: subnet-b base: apiVersion: ec2.aws.upbound.io/v1beta1 kind: Subnet spec: forProvider: cidrBlock: 10.0.2.0/24 availabilityZone: us-east-1b mapPublicIpOnLaunch: true region: us-east-1 providerConfigRef: name: default patches: - fromFieldPath: status.vpcId toFieldPath: spec.forProvider.vpcId type: FromCompositeFieldPath - fromFieldPath: spec.claimRef.name toFieldPath: spec.forProvider.tags.Name type: FromCompositeFieldPath transforms: - type: string string: fmt: "%s-subnet-b" - fromFieldPath: status.atProvider.id toFieldPath: status.subnetIds[1] type: ToCompositeFieldPath We will also need a provider to support EKS resource creation, to create the necessary provider, save the following content into .yaml file. YAML apiVersion: pkg.crossplane.io/v1 kind: Provider metadata: name: provider-aws spec: package: xpkg.upbound.io/crossplane-contrib/provider-aws:v0.54.2 controllerConfigRef: name: default And apply using: YAML kubectl apply -f <your-file-name>.yaml Crossplane Composite Resource Definition (XRD) Below, we’re going to build a Composite Resource Definition for the EKS cluster. Before diving in, one thing to note: If you’ve already created the network resources using the previous article, you may have noticed that the network composition includes a field that places the subnet ID into the composition resource’s status, specifically under status.subnetIds[0]. This value comes from the cloud's Subnet resource and is needed by other XCluster compositions. By placing it in the status field, the network composition makes it possible for other Crossplane compositions to reference and use it. Similar to what we did for network creation in the previous article, we’re going to create a Crossplane XRD, a Crossplane Composition, and finally a Claim that will result in the creation of an EKS cluster. At the end, I’ve included a table that serves as an analogy to help illustrate the relationship between the Composite Resource Definition (XRD), Composite Resource (XR), Composition, and Claim. To create an EKS XRD, save the following content into .yaml file: YAML apiVersion: apiextensions.crossplane.io/v1 kind: CompositeResourceDefinition metadata: name: xclusters.aws.platformref.crossplane.io spec: group: aws.platformref.crossplane.io names: kind: XCluster plural: xclusters claimNames: kind: Cluster plural: clusters versions: - name: v1alpha1 served: true referenceable: true schema: openAPIV3Schema: type: object required: - spec properties: spec: type: object required: - parameters properties: parameters: type: object required: - region - roleArn - networkRef properties: region: type: string description: AWS region to deploy the EKS cluster in. roleArn: type: string description: IAM role ARN for the EKS control plane. networkRef: type: object description: Reference to a pre-created XNetwork. required: - name properties: name: type: string status: type: object properties: network: type: object required: - subnetIds properties: subnetIds: type: array items: type: string And apply using: YAML kubectl apply -f <your-file-name>.yaml Crossplane Composition Composition is the implementation; it tells Crossplane how to build all the underlying resources (Control Plane, NodeGroup). To create an EKS composition, save the below content into a .yaml file: YAML apiVersion: apiextensions.crossplane.io/v1 kind: Composition metadata: name: cluster.aws.platformref.crossplane.io spec: compositeTypeRef: apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: XCluster resources: - name: network base: apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: XNetwork patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.networkRef.name toFieldPath: metadata.name - type: ToCompositeFieldPath fromFieldPath: status.subnetIds toFieldPath: status.network.subnetIds - type: ToCompositeFieldPath fromFieldPath: status.subnetIds[0] toFieldPath: status.network.subnetIds[0] readinessChecks: - type: None - name: eks base: apiVersion: eks.aws.crossplane.io/v1beta1 kind: Cluster spec: forProvider: region: us-east-1 roleArn: "" resourcesVpcConfig: subnetIds: [] endpointPrivateAccess: true endpointPublicAccess: true providerConfigRef: name: default patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.region toFieldPath: spec.forProvider.region - type: FromCompositeFieldPath fromFieldPath: spec.parameters.roleArn toFieldPath: spec.forProvider.roleArn - type: FromCompositeFieldPath fromFieldPath: status.network.subnetIds toFieldPath: spec.forProvider.resourcesVpcConfig.subnetIds - name: nodegroup base: apiVersion: eks.aws.crossplane.io/v1alpha1 kind: NodeGroup spec: forProvider: region: us-east-1 clusterNameSelector: matchControllerRef: true nodeRole: "" subnets: [] scalingConfig: desiredSize: 2 maxSize: 3 minSize: 1 instanceTypes: - t3.medium amiType: AL2_x86_64 diskSize: 20 providerConfigRef: name: default patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.region toFieldPath: spec.forProvider.region - type: FromCompositeFieldPath fromFieldPath: spec.parameters.roleArn toFieldPath: spec.forProvider.nodeRole - type: FromCompositeFieldPath fromFieldPath: status.network.subnetIds toFieldPath: spec.forProvider.subnets And apply using: YAML kubectl apply -f <your-file-name>.yaml Claim I'm taking the liberty to explain the claim in more detail here. First, it's important to note that a claim is an entirely optional entity in Crossplane. It is essentially a Kubernetes Custom Resource Definition (CRD) that the platform team can expose to application developers as a self-service interface for requesting infrastructure, such as an EKS cluster. Think of it as an API payload: a lightweight, developer-friendly abstraction layer. In the earlier CompositeResourceDefinition (XRD), we created the Kind XCluster. But by using a claim, application developers can interact with a much simpler and more intuitive CRD like Cluster instead of XCluster. For simplicity, I have referenced the XNetwork composition name directly instead of the Network claim resource name. Crossplane creates the XNetwork resource and appends random characters to the claim name when naming it. As an additional step, you'll need to retrieve the actual XNetwork name from the Kubernetes API and use it here. While there are ways to automate this process, I’m keeping it simple here, let me know via comments if there are interest and I write more about how to automate that. To create a claim, save the content below into a .yaml file. Please note the roleArn being referenced in this, that is the role I had mentioned earlier, AWS uses it to create other resources. YAML apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: Cluster metadata: name: demo-cluster namespace: default spec: parameters: region: us-east-1 roleArn: arn:aws:iam::914797696655:role/eks-crossplane-clsuter networkRef: name: crossplane-demo-network-jpv49 # <important> this is how EKS composition refers the network created earlier not the random character "jpv49" from XNetwork name And apply using: YAML kubectl apply -f <your-file-name>.yaml After this, you should see an EKS cluster in your AWS console, and ensure you are looking in the correct region. If there are any issues, look for error logs in the composite and managed resource. You could look at them using: YAML -- to get XCluster detail k get XCluster demo-cluster -o yaml # look for reconciliation errors or messages, you will also find reference to managed resource -- to look for status of a managed resource, example. k get Cluster.eks.aws.crossplane.io As I mentioned before, below is a table where I attempt to provide another analogy for various components used in Crossplane: componentanalogy XRD The interface, or blueprint for a product, defines what knobs users can turn XR (XCluster) A specific product instance with user-provided values Composition The function that implements all the details of the product Claim A customer-friendly interface for ordering the product, or an api payload. Patch I also want to explain an important concept we've used in our Composition: patching. You may have noticed the patches field in the .yaml blocks. In Crossplane, a composite resource is the high-level abstraction we define — in our case, that's XCluster. Managed resources are the actual cloud resources Crossplane provisions on our behalf — for example, the AWS EKS Cluster, Nodegroup A patch in a Crossplane Composition is a way to copy or transform data from/to the composite resource (XCluster) to/from the managed resources (Cluster, NodeGroup, etc.). Patching allows us to map values like region, roleArn, and names from the high-level composite to the actual underlying infrastructure — ensuring that developer inputs (or platform-defined parameters) flow all the way down to the cloud resources. Conclusion Using Crossplane, you can build powerful abstractions that shield developers from the complexities of infrastructure, allowing them to focus on writing application code. These abstractions can also be made cloud-agnostic, enabling benefits like portability, cost optimization, resilience and redundancy, and greater standardization.
Ideas of creating a distributed computing cluster (DCC) for database management systems (DBMS) have been striking me for quite a long time. If simplified, the DCC software makes it possible to combine many servers into one super server (cluster), performing an even balancing of all queries between individual servers. In this case, everything will appear for the application running on the DCC as if it was running with one server and one database (DB). It will not be dispersed databases on distributed servers, but work as one virtual one. All network protocols, replication exchanges, and proxy redirections will be concealed inside the DCC. At the same time, all resources of distributed servers, in particular RAM and CPU time, will be utilized evenly and in an efficient fashion. For example, in a cloud data processing center (DPC), it is possible to take one physical super server and divide it into a number of virtual DBMS servers. But the reverse procedure was not possible until now, i.e., it is not possible to take a number of physical servers and merge them into a single virtual DBMS super server. In some specified sense, DCC is a technology that makes it possible to merge physical servers into one virtual DBMS super server. I will take the liberty to make another comparison: DCC is just the same as the coherent nonuniform memory access (NUMA) technology except that it is used to merge SQL servers. But unlike NUMA, in DCC, software handles the synchronization (coherence) of the data and partly of the RAM, not the controller. For the sake of clarity, below is a diagram of the well-known connection of the client application to the DBMS server, and the DCC diagram immediately below that. Both diagrams are simplified, just for easy understanding. The idea behind the cluster is a decentralized model. In the figure above there is only one proxy server, but in general there can be more than one. This solution will result in the possibility to increase the DBMS scalability by a substantial margin relative to a typical single-server solution with the most powerful server at the moment. No such solution currently exists, or, at least, no one in my vast professional community is aware of such a solution. After five years of research, I worked out the logical architecture and interaction protocols in detail and, with the assistance of a handful of development personnel, created a working prototype that is undergoing load tests on a popular 1C8.x IT system under the management of PostgreSQL DBMS. MS SQL or Oracle may be the DBMS. Fundamentally, the choice of DBMS does not affect the ideas I will bring up. With this article, I am starting a series of articles on DCC, where I will gradually disclose one or another issue and offer solutions to them. I came up with this structure after speaking at one of the IT conferences, where the topic was found to be quite difficult to understand. The first article will be introductory, I will hit the peaks, skip the valleys (emphasizing non-obvious assertions), and outline what's to come in the following publications. For Which IT Systems DCC Is Effective The idea of DCC is to create a special software shell, which will perform all write requests simultaneously and synchronously on all servers, and read requests will be performed on a specific node (server) with user binding. In other words, users will be evenly distributed among the servers of the cluster: read requests will be executed locally on the server to which the user is bound, and change requests will be synchronously executed simultaneously on all servers (no logic violations will occur as a result). Therefore, provided that read requests significantly exceed write requests in terms of load, we get a roughly uniform load distribution among DCC servers (nodes). Let's first review this question: is the statement that the load of read requests far outweighs the load of write requests correct? To answer this question, it will be helpful to look back a bit at the history of the SQL language: what was the goal and what eventually came to fruition. A Quick Dive Into SQL SQL was originally planned as a language that could be used without programming or math skills. Here's an excerpt from Wikipedia: Codd used symbolic notation with mathematical notation of operations, but Chamberlin and Boyce wanted to design the language so that it could be used by any user, even those without programming skills or knowledge of math.[5] For now, it can be argued that programming skills for SQL are still needed, but definitely minimal. Most programmers have studied some basics of query optimization and have never heard of SQL Tuning by Dan Toe. A lot of logic for optimizing queries is concealed inside the DBMS. In the past, for example, MS SQL had a limit of 256 table joins; now in modern IT systems, it is common to have thousands of joins in a query. Dynamic SQL, where a query is constructed dynamically, is used widely and sometimes without much thought. The truth is there is no mathematically accurate model for plotting the optimal plan for executing a complex query. This problem is somewhat similar to the traveling salesman problem, and it is believed to have no exact mathematical solution. The conclusion is as follows: SQL queries have evolutionarily proven their effectiveness and almost all reporting is generated on SQL queries, which is not the case with business logic and transactional logic. Many of the SQL languages turned out to be not very convenient in terms of programming complex transactional logic. It does not support object-oriented programming and has very clumsy logic constructs. Therefore, it is safe to say that programming has split into two components. Writing a variety of SQL query reports, getting data to a customer or application server, and implementing the rest of the logic in the application-oriented language of the application (no matter if it's a two-tier or three-tier architecture). In terms of load on the DBMS, it looks like a hefty SQL constructs for reading and then lots of small ones for changing. Let us now consider the issue of load distribution of read-write queries on the DBMS server time wise. First, we need to define what load means and how it can be measured. The load will mean (in the order of priority of description): CPU (processor load), utilized RAM, and load on disk subsystem. CPU will be the main resource in terms of load. Let's consider an abstract OLTP system and divide all SQL calls from a set of parallel threads into two categories Read and Write. Next, based on the performance monitoring tools, plot an integral value such as CPU on a diagram. If the value is averaged for at least 30 seconds, we see that the value of the “Read” diagram is tens or even hundreds of times higher than the value of the “Write” diagram. This is because more users per unit of time can execute reports or macro transactions that use hefty SQL constructs on reading. Sure, there may be tasks when the system regularly loads data from replication queues and external systems, period end routine procedures are started, and backup procedures are started. But based on long-term statistics for an overwhelming number of IT systems, the load of SQL constructs on Read exceeds the load on Write by ten folds. Certainly, there may be exceptions, for example, billing systems where the fact of changes is recorded without any complex logic and reporting, but it is easy to check this with a special-purpose software and understand how effective DCC will be for the IT system. Strategic Area of Application Currently, DCC will be useful and perhaps vital for major companies with extensive information flows and a strong analytical component. For example, major banks may profit. With the help of relatively small servers, it is possible to compose a DCC, which will be far ahead of all existing supercomputers in terms of power. Needless to say, it won't be all about the pros. The downside will be the increasingly complex part of administering a distributed system and a definitive transaction slowdown. Unfortunately, it is true that the network protocols and logic circuits that DCC utilizes cause transactions to slow down. Currently, the target parameter is a transaction slowdown of no more than 15% in terms of time. But once again I repeat that in this case the system will become much more scalable, all peak loads will be without problems, and on average the transaction time will be less than in the case of using a single server. Therefore, if the system faces no problems with peak loads and strategically it is not expected, DCC will not be effective. In the future, DCC after the automation of administrative processes and optimization will probably also be effective for medium-sized companies because it will be possible to build a powerful cluster even using PCs with SSD (fast, unreliable, and cheap) disks. Its distributed structure will make it possible to easily disconnect a failed PC and connect a new one right on-the-fly. DCC's transaction control system will prevent data from being incorrectly recorded. Also, geopolitics cannot be ignored. For example, in case of lack of access to powerful servers, DCC will make it possible to build a powerful cluster using servers produced by domestic manufacturers. Why Transactional Replication Cannot Be Used for DCC This section requires a detailed description, and I will cover it in a separate article. Here I will point out only these problems: Many application developers, when using a DBMS, do not even think about what data access conflicts the system resolves within the engine. For example, it is impossible to set up transactional replication, achieve data synchronization across multiple servers, and call it a DBMS cluster. This solution will not resolve the conflict of simultaneous access to a writer-writer type record. Such collisions will certainly lead to a violation of the logic of the system behavior. Existing transaction replication protocols are also costly and such a system will be very much inferior to the single server option. In total, transactional replication is not suitable for ВСС because: 1. Excessive Costs of Typical Synchronous Transaction Replication Protocols Typical distributed transaction protocols have too many additional, primarily time-related, network costs. For one network call, up to three additional calls are received. In such a form, the simplest atomic operations degrade dramatically. 2. The Writer-Writer Conflict Is Not Resolved A conflict happens when the same data is changed simultaneously in different transactions. In terms of past change, the system only “remembers” the absolute last change (or history). The point of the SQL construct for sequential application gets lost. Such replication conflicts sometimes have no solutions at all. In a separate article, I will give an example of different replication types for PostgreSQL and Microsoft SQL, and I will explain: Why they cannot solve the transactional load balancing problem architecturallyWhy it is not solved architecturally at the hardware level The writer-writer problem is fundamentally unsolvable without a proxy service at the logical level of analyzing the application's SQL traffic. Exchange Mechanisms (Protocols) A full architectural description of DCC will be provided in a separate article. For now, let's confine ourselves to a brief summary to outline the issue at hand. All queries to the DBMS in DCC go through a proxy service. For example, on 1C systems, it can be installed on the application server. The proxy service recognizes whether the query type is Read or Write. And if Read, it sends it to the server bound to the user (session). If the query type is Change, it sends it to all servers asynchronously. It does not proceed to the next query until it receives a positive response from all servers. If an error occurs, it is propagated to the client application and the transaction is rolled back on all servers. If all servers have confirmed successful execution of the SQL construct, only then does the proxy process the next client SQL query. This is the kind of logic that does not result in logical contradictions. As can be seen, this arrangement incurs additional network and logical costs, although with proper optimization, they are minimal (we seek to achieve no more than 15% of the time delay of transactions). The algorithm described above is the basic protocol, and it is what we will call mirror-parallel. Unfortunately, this protocol is not logically capable of implementing mirrored data replication for all IT systems. In some cases, the data might for sure differ due to the specific nature of the system, another protocol is implemented for this purpose — “centralized asynchronous” — which will resolve synchronous information transfer for sure. The next section will cover it. Why a Centralized Protocol Is Needed in DCC Unfortunately, in some cases, sending the same structure to different servers gets assuredly different results. For example, when inserting a new record into a table, the primary key is generated based on the GUID on the server part. In this case, based on the definition alone, we will for sure get different results on different servers. As an option, it is possible to train the expert system of the proxy service to recognize that this field is formed on the server, form it explicitly on the proxy, and insert it into the query text. What if it is impossible to do so for some reason? To resolve such problems, another protocol and server is introduced. Let's call it Conditionally Central. Next, it will be clear that it is not actually a central server. The protocol algorithm is as follows. The proxy service recognizes that a SQL construct for a change is highly likely to produce different results on different servers. Therefore, it immediately redirects the query to the Conditionally Central server. Then after it is executed, using replication triggers, retrieve the changes that the query resulted in and send asynchronously all those changes to the remaining servers. And then proceed to execute the next command. Similar to the mirror-parallel protocol, if at least one of the servers encounters an error, it is redirected to the client and the transaction is rolled back. In this protocol, any collisions are completely prevented, data will always be guaranteed to be synchronous, and there will be almost no distributed deadlocks. But there is an essential downside: Due to its specific nature, the protocol imposes the highest runtime costs. Therefore, it will only be used in exceptional cases, otherwise no target delay parameters of no more than 15% will be even possible. Mechanisms for Ensuring Integrity and Synchrony of Distributed Data at the Transaction Level in DCC As we discussed in the previous section, there are logical (e.g., NEWGUID) SQL operations on change that, when executed simultaneously on different servers, will for sure take different values. Let us rule out all sorts of random functions and fluctuations. Let's assume we have explicit arithmetic procedures, e.g. UPDATE Summary Table SET Total = Total+Delta WHERE ProductID = Y. Certainly, such an arrangement in a single-thread option will lead to the same result and the data will be synchronous, because there are always laws of mathematics. But, if such constructs are executed in multithread mode by varying the Delta value, thread tangling may occur due to violation of the chronology of query execution. Which will lead to either deadlocks or data synchronization violations. In fact, it may turn out that the results of transactions on different servers may differ. Sure, it will be a rare occurrence, and it can be reduced by certain actions, but it cannot be completely ruled out, as well as it cannot be completely resolved without significant performance degradation. Such algorithms do not exist as a matter of fact, just as there is no such thing as fully synchronous time for multiple servers or network queries that are executed for sure for a certain amount of time. Therefore, DCC has a distributed transaction management service and, in particular, a transaction hash-sum check is mandatory. Why hash-sum? Because it is possible to quickly check the content of these changes on all servers. If everything matches, the transaction is confirmed, and if not, it is rolled back with a corresponding error. More details will follow in a separate article. In terms of mathematics, there are some interesting similarities with quantum mechanics, in particular with the transactional-loop theory (there is such a marginal theory). The Issue of Distributed Deadlocks in DCC This problem is that one of the key problems in DCC and in terms of risks of DCC implementation is the most dangerous. This is due to the fact that the occurrence of distributed deadlocks in DCC is a consequence of thread confusion due to the change in the chronology of SQL queries execution on different servers. This situation occurs due to uneven load on servers and network interfaces. In this case, unlike local deadlocks, which require at least two locking objects to occur, there can be only one object in a distributed deadlock. To reduce distributed deadlocks, several process challenges need to be addressed, one of them being the allocation of different physical network interfaces for writing and reading. After all, if we consider the ratio of CPU operations like Read to Write, there will be a ratio of one order, but for network traffic, the ratio will start from two orders of magnitude, more than hundreds of times. Therefore, by splitting these operations (Read-Write) on physically different channels of network communications, we can guarantee a certain time of delivery of Write-type SQL queries to all servers. Also, the fewer locks there are in the system, the less likely distributed deadlocks are in data. However, using DCCs as an additional benefit, it is possible to expand such bottlenecks, if any, in the system at the level of settings. If distributed deadlocks still occasionally occur, there is a special DCC service that monitors all blocking processes on all servers and resolves them by rolling back one of the transactions. More details will follow in a separate article. Special Features of DCC Administration Administration of a distributed system is certainly more complicated than that of a local system, especially with the requirements of operation 24/7. And all potential DCC users are just the proud owners of IT systems with 24/7 operation mode. Immediate problems include distributed database recovery and hot plugging of a new server to the DCC. Prompt data reconciliation in distributed databases is also necessary, despite transaction reconciliation mechanisms. Performance monitoring tasks and, in particular, the aggregation of counter data across related transactions and cluster servers in general begin to emerge. There are some security issues with setting up a proxy service. A full list of problems and proposed solutions will be in a separate article. Parallel Computing Technologies as a Solution to Increase the Efficiency of DCC Use For a scalable IT system, high parallelism of processes within the database is essential. For parallel reporting, as a rule, this issue does not occur. For transactional workloads, due to historical vestiges of suboptimal architecture, locks at the level of changing the same records (writer-writer conflict) are possible. If the IT system can be changed and there is open-source code, then the system can be optimized. And if the system is closed, what shall we do in this case? In case of using DCC, there are opportunities at the level of administration to circumvent such restrictions. Or at least expand the possibilities. In particular, through customizations, we can enable changing the same record without waiting for the transaction to be committed — if a dirty read is possible, of course. At the same time, if the transaction is rolled back, the change data in the chronological sequence are also rolled back. This situation is exactly appropriate, for example, for tables with aggregation of totals. I already have solutions for this problem, and I believe that regardless of using DCC, it is necessary to expand administrative settings of the DBMS, both Postgres and MSSQL (haven't investigated the issue on Oracle). More details will follow in a separate article. It is also necessary to disclose the topic of dirty reading in DCC and possible minor improvements taking this into account, such as the introduction of virtual locks. Plan for the Following Publications on the Topic of DCC Article 2. DCC load-testing results Article 3. Why transactional replication can't be used for DCC Article 4. Brief architectural description of the DCC Article 5. The purpose of a centralized protocol in DCC Article 6: Mechanisms for ensuring integrity and synchronization of distributed data at the transaction level in DCC Article 7. The problem of distributed deadlocks in DCC Article 8. Special features of DCC administration Article 9: Parallel computing technologies as a tool to increase the efficiency of DCC utilization Article 10. Example of DCC integration with 1C 8.x
Enterprise data solutions are growing across data warehouses, data lakes, data lakehouse, and hybrid platforms in cloud services. As the data grows exponentially across these services, it's the data practitioners' responsibility to secure the environment with secure guardrails and privacy boundaries. In this article, we will learn a framework for implementing security protocols in AWS and learn how to implement them across Redshift, Glue, DynamoDB, and Aurora database services. The Security Framework for Modern Data Infrastructure When building scalable and secure AWS-native data platforms (Glue, Redshift, DynamoDB, Aurora), I recommend thinking of security in terms of seven pillars. Each pillar comes with practical checkpoints you can implement and audit against. Pillar 1: Identity and Access Control The identity and access control framework ensures only the right people and systems can touch your data. This starts with centralizing identities with IAM Identity Center/SSO. Enforce the principle of least privilege with IAM roles (not long-lived users) that will grant access to identities, and only the user needs access to perform their job duties. We can also leverage attribute-based access control, which uses tags at the department level, department=finance, or data_classification=pii. By starting with identity as the first pillar in building a secure data solution, we establish clear boundaries across each database object with an owning principal. Pillar 2: Data Classification and Catalog Governance The second step is to go a level deeper and classify the datasets attached to identities. In a data lake, we can label datasets, for example, like pii=high or pii=highly-confidential, etc. Once classified, these tags drive tag-based access control (TBAC) across services such as Glue and Redshift, ensuring only the right people see the right data. Along with this, maintaining column-level metadata like region or compliance domain in the Glue Data Catalog makes governance consistent and transparent. With proper classification and catalog governance, policies can be applied uniformly across the enterprise instead of in silos Pillar 3: Network and Perimeter Security Keep your data safe by making sure it only travels in private, secure paths. Put your databases in private networks, use special connections (like VPC endpoints) to reach services, and make sure all data leaving the system is encrypted and checked. Pillar 4: Encryption as Needed We should not treat every data in the same way; it has to be based on the data classification from Pillar 2. For example, some data are red (very sensitive, like financial or health records), which should be tightly secured in AWS at rest using KMS and CMKs with rotation turned on. A good practice is not to store red data in open or persistent storage. Orange data is important but less sensitive, like business logs, we should ensure proper bucket polices are applied. Green is general data that can be shared more freely, like logs, but encryption is not needed. Pillar 5: Secrets and Credential Management Never store your passwords in a code base or in any queries. In AWS, you can keep them safe in Secrets Manager, which locks them up and changes them periodically. Instead of giving every app a fixed password, let it borrow a temporary key through IAM roles, which is safer and harder to misuse. For databases like Aurora, you don’t even need a password at all; you can log in with a short-lived token. The rule is simple: don’t use permanent keys; always use rotating or temporary ones. Pillar 6: Monitoring, Detection, and Audit Think of monitoring like a CCTV camera for your data. You should always know who touched what, when, and why. In AWS, you can turn on CloudTrail to record all actions and save these records safely in CloudWatch Logs. Tools like GuardDuty act like guards watching for unusual activity, while Security Hub gathers all warnings in one place. For stricter checks, databases like Aurora and Redshift have their own audit logs, and tools like Macie scan S3 to catch if sensitive files are exposed. The idea is simple: if something goes wrong, you should be able to trace it back quickly. Pillar 7: Policy as Code We can manage the entire cloud policies as infrastructure as code rather than manual deployments for scalability purposes. In AWS, you can define things like KMS keys, IAM roles, or Lake Formation policies in CloudFormation, CDK, or Terraform. Before changes go live, tools like cfn-nag or tfsec check if something looks unsafe. For risky actions (like changing IAM roles or encryption keys), you can set up approval steps so no one sneaks in a bad change. Example #1: AWS Glue + Lake Formation (Catalog, ETL, Data Perimeter) AWS Glue works like the factory that moves and transforms your data, while Lake Formation is the guardrail that makes sure only the right people and systems can see the right parts of that data. Together, they help centralize governance, protect sensitive fields, and ensure ETL jobs run safely without leaking information. Steps to Implement Security 1. Classify your data with tags: Define tags such as: pii= {none, low, high}, pii={true, false}, region={us, eu}. Apply these tags to databases, tables, and even columns in the Glue Data Catalog. 2. Control access with tag-based policies (TBAC): Create Lake Formation permissions using tags: Analyst role: pii!=highOps role: pii in {none, low}Compliance role: {Full access, audit rights} 3. Apply row-level filters and column masking: Use LF-governed tables to filter rows (e.g., only show region=session_region). Mask sensitive columns like email, date of birth, with hash values. 4. Secure your Glue jobs: Turn on encryption for S3, CloudWatch, and job bookmarks with KMS CMKs.Run Glue jobs inside a VPC, with S3 routed through Gateway/Interface Endpoints, not the public internet. Assign a minimal IAM role per job, keeping dev and prod roles separate and scoped to exact resources. 5. Keep catalog and ETL hygiene strong: Block public access to S3 buckets (disable ACLs/policies). Require encryption on all writes (aws:SecureTransport=true, x-amz-server-side-encryption). Enable continuous logging of Glue jobs into CloudWatch for audit and troubleshooting. Example #2: Amazon Redshift (Warehouse Analytics) Amazon Redshift is your data warehouse; it's powerful for analytics, but also home to a lot of sensitive data. Protecting it means enforcing who can see which rows or columns, isolating traffic so nothing leaks, and making sure every action is logged. Steps to Implement Security 1. Network and encryption: Place Redshift clusters or serverless workgroups in private subnets (no public endpoints). Turn on encryption at rest with a customer-managed KMS key. Force SSL connections (reject non-TLS). Use Enhanced VPC Routing so COPY/UNLOAD only moves data via VPC endpoints. 2. Identity and SSO: Use IAM Identity Center or SAML for single sign-on. Avoid static keys, rely on role chaining for COPY/UNLOAD to S3. 3. Fine-grained controls: Enable Row-Level Security (RLS) and Column-Level Security (CLS). Use dynamic data masking for fields like SSNs, showing only partial data unless the role allows full access. 4. Audit and logging: Enable database audit logging to S3/CloudWatch. Integrate with CloudTrail for management events. Example #3: Amazon DynamoDB (Operational Data) Amazon DynamoDB powers fast apps at scale, but governance here is about restricting who can touch which items, keeping traffic private, and ensuring logs exist for compliance. Steps to Implement Security 1. Item-level permissions: Use IAM conditions like dynamodb:LeadingKeys to tie access to a user’s partition key (e.g., only see their own orders). For Example, bind customer_id in the request to the caller’s IAM tag. 2. Private access and encryption: Use Gateway VPC Endpoints for DynamoDB; block non-VPC traffic if possible (via SCP). Require encryption at rest with customer-managed KMS keys. 3. Resilience and lifecycle: Turn on Point-in-Time Recovery (PITR) and on-demand backups. Use TTL for short-lived items to reduce exposure. (But don’t rely on TTL alone for compliance deletion.) 4. Audit: Enable CloudTrail data events for sensitive tables where you need full visibility (note: extra cost). 5. Streams and integrations: If using DynamoDB Streams for CDC, ensure consumer apps (Lambda, Glue) run inside a VPC with least-privilege roles. Force them to write only into encrypted destinations. Example #4: Amazon Aurora (Relational Data) Amazon Aurora is a managed relational database (compatible with PostgreSQL and MySQL) that runs mission-critical workloads. Because it often stores highly sensitive transactional data, the governance model here must combine AWS controls (encryption, network) with native SQL features (roles, RLS, auditing). Steps to Implement Security 1. Network and endpoints: Deploy Aurora clusters in private subnets, never expose public endpoints. Restrict inbound rules to application security groups only, not wide CIDRs. 2. Encryption and TLS: Enable KMS CMK encryption at cluster creation. Enforce TLS connections: set rds.force_ssl=1 (Postgres) to reject non-SSL clients. 3. Identity and credentials: Store master and user credentials in AWS Secrets Manager with automatic rotation (Lambda). Use IAM Database Authentication for short-lived token-based access — integrates neatly with CloudTrail for auditing. 4. Database-level governance: Define roles with least privilege: Shell CREATE ROLE analyst NOINHERIT; GRANT USAGE ON SCHEMA sales TO analyst; GRANT SELECT (order_id, amount, region) ON sales.orders TO analyst; Enable row-level security (RLS): Shell ALTER TABLE sales.orders ENABLE ROW LEVEL SECURITY; CREATE POLICY region_isolation ON sales.orders USING (region = current_setting('app.user_region', true)); 5. Auditing: Enable pgaudit to log SELECT, DDL, and DML events as needed. Stream Aurora/Postgres logs to CloudWatch Logs; set appropriate retention policies. 6. Backups, PITR, and disaster recovery: Turn on automated backups and Point-in-Time Recovery (PITR). Regularly test restores to verify recovery SLAs.For stronger assurance, create cross-region read replicas and protect them with replicated CMKs. AWS Security Framework Cheatsheet ControlGlueRedshiftDynamodbaurora Network isolation VPC jobs, endpoints Private subnets, no public endpoint, Enhanced VPC Routing Gateway VPC Endpoint Private subnets, SG-only ingress Encryption at rest KMS on catalog, logs, job I/O KMS CMK cluster/workgroup KMS CMK table KMS CMK cluster TLS in transit VPC → endpoints Require SSL TLS to endpoint (SigV4) Enforce SSL (rds.force_ssl) Fine-grained access LF TBAC, row/cell masking RLS/CLS + masking policies + late-binding views IAM + LeadingKeys ABAC GRANTs + RLS + views/pgcrypto Secrets & auth Job role least privilege SSO/SAML + IAM roles for COPY/UNLOAD IAM roles, no static keys Secrets Manager + rotation, optional IAM DB Auth Audit & detection Catalog access logs, Glue job logs User activity log, CloudTrail, QMRs CloudTrail data events pgaudit + CloudWatch Logs Backup/Recovery ETL is stateless Snapshots, cross-region as needed PITR + on-demand backups Automated backups, PITR, cross-region replica By grounding security in seven pillars, identity, classification, network, encryption, secrets management, monitoring, and policy as code, it helps organizations gain more than guardrails; they gain a framework for sustainable and secure growth.
From Requirements to Results: Anchoring Agile With Traceability
September 23, 2025 by
AI Readiness: Why Cloud Infrastructure Will Decide Who Wins the Next Wave
September 23, 2025 by
Model Evaluation Metrics Explained
September 23, 2025 by
LLMs at the Edge: Decentralized Power and Control
September 23, 2025 by
September 23, 2025 by CORE
AI Readiness: Why Cloud Infrastructure Will Decide Who Wins the Next Wave
September 23, 2025 by
September 23, 2025 by CORE
AI Readiness: Why Cloud Infrastructure Will Decide Who Wins the Next Wave
September 23, 2025 by
Mastering Fluent Bit: Top 3 Telemetry Pipeline Output Plugins for Developers (Part 7)
September 23, 2025 by CORE
September 23, 2025 by CORE
Top 5 RAD Platforms for Developers
September 23, 2025 by
Integrating AI Into Test Automation Frameworks With the ChatGPT API
September 22, 2025 by
September 23, 2025 by CORE
AI Readiness: Why Cloud Infrastructure Will Decide Who Wins the Next Wave
September 23, 2025 by
Mastering Fluent Bit: Top 3 Telemetry Pipeline Output Plugins for Developers (Part 7)
September 23, 2025 by CORE
LLMs at the Edge: Decentralized Power and Control
September 23, 2025 by
September 23, 2025 by CORE
AI Readiness: Why Cloud Infrastructure Will Decide Who Wins the Next Wave
September 23, 2025 by