DZone Spotlight

Wednesday, September 24 View All Articles »

Spring REST API Client Flavors: From RestTemplate to RestClient

By Horatiu Dan

Just as humans have always preferred co-existing and communicating ideas, looking for and providing pieces of advice from and to their fellow humans, applications nowadays find themselves in the same situation, where they need to exchange data in order to collaborate and fulfill their purposes. At a very high level, applications’ interactions are carried out either conversationally (the case of REST APIs), where the information is exchanged synchronously by asking and responding, or asynchronously via notifications (the case of event-driven APIs), where data is sent by producers and picked up by consumers as it becomes available and they are ready. This article is an analysis of the synchronous communication between a client and a server via REST, with a focus on the client part. Its main purpose is to present how a Spring REST API client can be implemented, first using the RestTemplate, then the newer RestClient and seamlessly accomplish the same interaction. A Brief History RestTemplate was introduced in Spring Framework version 3.0, and according to the API reference, it’s a “synchronous client to perform HTTP requests, exposing a simple, template method API over underlying HTTP client libraries.” Flexible and highly configurable, it’s been for a long time the best choice when a fully-fledged synchronous but blocking HTTP client was implemented as part of a Spring application. As time has passed, its lack of non-blocking capabilities, the use of the old-fashioned template pattern, and the pretty cumbersome API significantly contributed to the emergence of a new, more modern HTTP client library, one that may also handle non-blocking and asynchronous calls. Spring Framework version 5.0 introduced WebClient, “a fluent, reactive API, over underlying HTTP client libraries.” It was especially designed for the WebFlux stack and by following the modern and functional API style, it was much cleaner and easier to use by developers. Nevertheless, for blocking scenarios, WebClient‘s ease of use benefit came with an extra cost – the need to add an additional library dependency into the project. Starting with Spring Framework version 6.1 and Spring Boot version 3.2, a new component is available — RestClient — which “offers a more modern API for synchronous HTTP access.” The evolution has been quite significant, developers nowadays may choose among these three options, RestTemplate, WebClient and RestClient, depending on the application needs and particularities. Implementation As stated above, the proof of concept in this article experiments with both RestTemplate and RestClient, leaving the WebClient aside as the communication here is conversational, that is synchronous. There are two simple actors involved, two applications: figure-service – the server that exposes the REST API and allows managing Figuresfigure-client – the client that consumes the REST API and actually manages the Figures Both are custom-made and use Java 21, Spring Boot version 3.5.3, and Maven version 3.9.9. A Figure is a generic entity that could denote a fictional character, a superhero, or a Lego mini-figure, for instance. The Server figure-service is a small service that allows performing common CRUD operations on simple entities that represent figures. As the focus in this article is on the client, server characteristics are only highlighted. The implementation is done in a standard, straightforward manner in accordance with the common best practices. The service exposes a REST API to manage figures: read all – GET /api/v1/figuresread one – GET /api/v1/figures/{id}read a random one – GET /api/v1/figures/randomcreate one – POST /api/v1/figuresupdate one – PUT /api/v1/figures/{id}delete one – DELETE /api/v1/figures/{id} The operations are secured at a minimum with an API key that shall be available as a request header Plain Text "x-api-key": the api key Figure entities are stored in an in-memory H2 database, described by a unique identifier, a name, and a code, and modelled as below: Java @Entity @Table(name = "figures") public class Figure { @Id @GeneratedValue private Long id; @Column(name = "name", unique = true, nullable = false) private String name; @Column(name = "code", nullable = false) private String code; ... } While the id and the name are visible to the outside world, the code is considered business domain information and kept private. Thus, the used DTOs look as below: Java public record FigureRequest(String name) {} public record FigureResponse(long id, String name) {} All server exceptions are handled generically in a single ResponseEntityExceptionHandler and sent back to the client in the following form, with the corresponding HTTP status: JSON { "title": "Bad Request", "status": 400, "detail": "Figure not found.", "instance": "/api/v1/figures/100" } Java public record ErrorResponse(String title, int status, String detail, String instance) {} Basically, in this service implementation, a client receives either the aimed response (if any) or one highlighting the error (detailed at point 4), in case there is a service one. This resource contains the figure-service source code, it may be browsed for additional details. The Client Let’s assume figure-client is an application that uses Figure entities as part of its business operations. As these are managed and exposed by the figure-service, the client needs to communicate via REST with the server, but also to comply with the contract and the requirements of the service provider. In this direction, a few considerations are needed prior to the actual implementation. Contract Since the synchronous communication is first implemented using a RestTemplate, then modified to use the RestClient, the client operations are outlined in the interface below. Java public interface FigureClient { List<Figure> allFigures(); Optional<Figure> oneFigure(long id); Figure createFigure(FigureRequest figure); Figure updateFigure(long id, FigureRequest figureRequest); void deleteFigure(long id); Figure randomFigure(); } In this manner, the implementation change is isolated and does not impact other application parts. Authentication As the server access is secured, a valid API key is needed. Once available, it is stored as an environment variable and used via the application.properties into a ClientHttpRequestInterceptor. According to the API reference, such a component defines the contract to intercept client-side HTTP requests and allows implementers to modify the outgoing request and/or incoming response. For this use case, all requests are intercepted, and the configured API key is set as the x-api-key header, then the execution is resumed. Java @Component public class AuthInterceptor implements ClientHttpRequestInterceptor { private final String apiKey; public AuthInterceptor(@Value("${figure.service.api.key}") String apiKey) { this.apiKey = apiKey; } @Override public ClientHttpResponse intercept(HttpRequest request, byte[] body, ClientHttpRequestExecution execution) throws IOException { request.getHeaders() .add("x-api-key", apiKey); return execution.execute(request, body); } } The AuthInterceptor is used in the RestTemplate configuration. Data Transfer Objects (DTOs) Particularly in this POC, as the Figure entities are trivial in terms of the attributes that describe them, the DTOs used in the operations of interest and by the RestTemplate are simple as well. Java public record FigureRequest(String name) {} public record Figure(long id, String name) {} Since once read, Figure objects might be further used, their name was simplified, although they denote response DTOs. Exception Handling RestTemplate (and then RestClient) allows setting a ResponseErrorHandler implementation during its configuration, a strategy interface used to determine whether a particular response has errors or not, and permits custom handling. In this POC, as the figure-service sends all errors in the same form, it is very convenient and easy to adopt a generic handling manner. Java @Component public class CustomResponseErrorHandler implements ResponseErrorHandler { private static final Logger log = LoggerFactory.getLogger(CustomResponseErrorHandler.class); private final ObjectMapper objectMapper; public CustomResponseErrorHandler() { objectMapper = new ObjectMapper(); } @Override public boolean hasError(ClientHttpResponse response) throws IOException { return response.getStatusCode().isError(); } @Override public void handleError(URI url, HttpMethod method, ClientHttpResponse response) throws IOException { HttpStatusCode statusCode = response.getStatusCode(); String body = new String(response.getBody().readAllBytes()); if (statusCode.is4xxClientError()) { throw new CustomException("Client error.", statusCode, body); } String message = null; try { message = objectMapper.readValue(body, ErrorResponse.class).detail(); } catch (JsonProcessingException e) { log.error("Failed to parse response body: {}", e.getMessage(), e); } throw new CustomException(message, statusCode, body); } @JsonIgnoreProperties(ignoreUnknown = true) private record ErrorResponse(String detail) {} } The logic here is the following: Both client and server errors are considered and handled — see hasError() method.All errors result in a custom RuntimeException decorated with an HTTP status code and a detail, the default being the general Internal Server Error and the raw response body, respectively. Java public class CustomException extends RuntimeException { private final HttpStatusCode statusCode; private final String detail; public CustomException(String message) { super(message); this.statusCode = HttpStatusCode.valueOf(HttpStatus.INTERNAL_SERVER_ERROR.value()); this.detail = null; } public CustomException(String message, HttpStatusCode statusCode, String detail) { super(message); this.statusCode = statusCode; this.detail = detail; } public HttpStatusCode getStatusCode() { return statusCode; } public String getDetail() { return detail; } } In case of recoverable errors, all methods declared in FigureClient are throwing CustomExceptions, thus providing a simple exception handling mechanism. An extraction of the detail provided by the figure-service in the response body is first attempted and if possible, included in the CustomException, otherwise, the body is set as such Useful to Have Although not required, especially during development, but not only, it proves very useful to be able to see the requests and the responses exchanged in the logs of the client application. In order to accomplish this, a LoggingInterceptor is added to the RestTemplate configuration. Java @Component public class LoggingInterceptor implements ClientHttpRequestInterceptor { private static final Logger log = LoggerFactory.getLogger(LoggingInterceptor.class); @Override public ClientHttpResponse intercept(HttpRequest request, byte[] body, ClientHttpRequestExecution execution) throws IOException { logRequest(body); ClientHttpResponse response = execution.execute(request, body); logResponse(response); return response; } private void logRequest(byte[] body) { var bodyContent = new String(body); log.debug("Request body : {}", bodyContent); } private void logResponse(ClientHttpResponse response) throws IOException { var bodyContent = StreamUtils.copyToString(response.getBody(), Charset.defaultCharset()); log.debug("Response body: {}", bodyContent); } } Here, only the request and response bodies are logged, although other items might be of interest as well (headers, response statuses, etc.). Although useful, there is a gotcha worth explaining that needs to be taken into account. As it can be depicted, when the response is logged in the above interceptor, it is basically read and the stream “consumed”, which determines the client to eventually end up with an empty body. To prevent this, a BufferingClientHttpRequestFactory component shall be used, a component that allows buffering the stream content into memory and thus be able to read the response twice. The response availability is now resolved, but buffering the entire response body into memory might not be a good idea when its size is significant. Before jumping into blindly using it out of the box, developers should analyze the possible performance impact and adapt, particularly for each application. Configuration Having clarified the figure-service contract and requirements, and moreover, having already implemented certain “pieces,” the RestTemplate can now be configured. Java @Bean public RestOperations restTemplate(LoggingInterceptor loggingInterceptor, AuthInterceptor authInterceptor, CustomResponseErrorHandler customResponseErrorHandler) { RestTemplateCustomizer customizer = restTemplate -> restTemplate.getInterceptors() .addAll(List.of(loggingInterceptor, authInterceptor)); return new RestTemplateBuilder(customizer) .requestFactory(() -> new BufferingClientHttpRequestFactory(new SimpleClientHttpRequestFactory())) .errorHandler(customResponseErrorHandler) .build(); } A RestTemplateBuilder is used, the LoggingInterceptor, AuthInterceptor are added via a RestTemplateCustomizer, while the error handler is set to a CustomResponseErrorHandler instance. RestTemplate Implementation Once the RestTemplate instance is constructed, it can be injected into the actual FigureClient implementation and used to communicate with the figure-service. Java @Service public class FigureRestTemplateClient implements FigureClient { private final String url; private final RestOperations restOperations; public FigureRestTemplateClient(@Value("${figure.service.url}") String url, RestOperations restOperations) { this.url = url; this.restOperations = restOperations; } @Override public List<Figure> allFigures() { ResponseEntity<Figure[]> response = restOperations.exchange(url, HttpMethod.GET, null, Figure[].class); Figure[] figures = response.getBody(); if (figures == null) { throw new CustomException("Could not get the figures."); } return List.of(figures); } @Override public Optional<Figure> oneFigure(long id) { ResponseEntity<Figure> response = restOperations.exchange(url + "/{id}", HttpMethod.GET, null, Figure.class, id); Figure figure = response.getBody(); if (figure == null) { return Optional.empty(); } return Optional.of(figure); } @Override public Figure createFigure(FigureRequest figureRequest) { HttpEntity<FigureRequest> request = new HttpEntity<>(figureRequest); ResponseEntity<Figure> response = restOperations.exchange(url, HttpMethod.POST, request, Figure.class); Figure figure = response.getBody(); if (figure == null) { throw new CustomException("Could not create figure."); } return figure; } @Override public Figure updateFigure(long id, FigureRequest figureRequest) { HttpEntity<FigureRequest> request = new HttpEntity<>(figureRequest); ResponseEntity<Figure> response = restOperations.exchange(url + "/{id}", HttpMethod.PUT, request, Figure.class, id); Figure figure = response.getBody(); if (figure == null) { throw new CustomException("Could not update figure."); } return figure; } @Override public void deleteFigure(long id) { restOperations.exchange(url + "/{id}", HttpMethod.DELETE, null, Void.class, id); } @Override public Figure randomFigure() { ResponseEntity<Figure> response = restOperations.exchange(url + "/random", HttpMethod.GET, null, Figure.class); Figure figure = response.getBody(); if (figure == null) { throw new CustomException("Could not get a random figure."); } return figure; } } In order to observe how this solution works end-to-end, first, the figure-service is started. A CommandLineRunner is configured there, so that a few Figure entities are persisted into the database. Java @Bean public CommandLineRunner initDatabase(FigureService figureService) { return args -> { log.info("Loading data..."); figureService.create(new Figure("Lloyd")); figureService.create(new Figure("Jay")); figureService.create(new Figure("Kay")); figureService.create(new Figure("Cole")); figureService.create(new Figure("Zane")); log.info("Available figures:"); figureService.findAll() .forEach(figure -> log.info("{}", figure)); }; } Then, as part of the figure-client application, a FigureRestTemplateClient instance is injected into the following integration test. Java @SpringBootTest class FigureClientTest { @Autowired private FigureRestTemplateClient figureClient; @Test void allFigures() { List<Figure> figures = figureClient.allFigures(); Assertions.assertFalse(figures.isEmpty()); } @Test void oneFigure() { long id = figureClient.allFigures().stream() .findFirst() .orElseThrow(() -> new RuntimeException("No figures found")) .id(); Optional<Figure> figure = figureClient.oneFigure(id); Assertions.assertTrue(figure.isPresent()); } @Test void createFigure() { var request = new FigureRequest( "Fig " + UUID.randomUUID()); Figure figure = figureClient.createFigure(request); Assertions.assertNotNull(figure); Assertions.assertTrue(figure.id() > 0L); Assertions.assertEquals(request.name(), figure.name()); CustomException ex = Assertions.assertThrows(CustomException.class, () -> figureClient.createFigure(request)); Assertions.assertEquals("A Figure with the same 'name' already exists.", ex.getMessage()); Assertions.assertEquals(HttpStatus.BAD_REQUEST.value(), ex.getStatusCode().value()); Assertions.assertEquals(""" {"title":"Bad Request","status":400,"detail":"A Figure with the same 'name' already exists.","instance":"/api/v1/figures"}""", ex.getDetail()); } @Test void updateFigure() { List<Figure> figures = figureClient.allFigures(); long id = figures.stream() .findFirst() .orElseThrow(() -> new RuntimeException("No figures found")) .id(); var updatedRequest = new FigureRequest("Updated Fig " + UUID.randomUUID()); Figure updatedFigure = figureClient.updateFigure(id, updatedRequest); Assertions.assertNotNull(updatedFigure); Assertions.assertEquals(id, updatedFigure.id()); Assertions.assertEquals(updatedRequest.name(), updatedFigure.name()); Figure otherExistingFigure = figures.stream() .filter(f -> f.id() != id) .findFirst() .orElseThrow(() -> new RuntimeException("Not enough figures")); var updateExistingRequest = new FigureRequest(otherExistingFigure.name()); CustomException ex = Assertions.assertThrows(CustomException.class, () -> figureClient.updateFigure(id, updateExistingRequest)); Assertions.assertEquals(HttpStatus.INTERNAL_SERVER_ERROR.value(), ex.getStatusCode().value()); } @Test void deleteFigure() { long id = figureClient.allFigures().stream() .findFirst() .orElseThrow(() -> new RuntimeException("No figures found")) .id(); figureClient.deleteFigure(id); CustomException ex = Assertions.assertThrows(CustomException.class, () -> figureClient.deleteFigure(id)); Assertions.assertEquals(HttpStatus.BAD_REQUEST.value(), ex.getStatusCode().value()); Assertions.assertEquals("Figure not found.", ex.getMessage()); } @Test void randomFigure() { CustomException ex = Assertions.assertThrows(CustomException.class, () -> figureClient.randomFigure()); Assertions.assertEquals(HttpStatus.INTERNAL_SERVER_ERROR.value(), ex.getStatusCode().value()); Assertions.assertEquals("Not implemented yet.", ex.getMessage()); } } When running, for instance, the above createFigure() test, the RestTemplate and the LoggingInterceptor contribute to clearly describe what’s happening and display it in the client log: Plain Text [main] DEBUG RestTemplate#HTTP POST http://localhost:8082/api/v1/figures [main] DEBUG InternalLoggerFactory#Using SLF4J as the default logging framework [main] DEBUG RestTemplate#Accept=[application/json, application/*+json] [main] DEBUG RestTemplate#Writing [FigureRequest[name=Fig 6aa854a5-ba7a-4bbf-8160-70adf7d3e59b]] with org.springframework.http.converter.json.MappingJackson2HttpMessageConverter [main] DEBUG LoggingInterceptor#Request body : {"name":"Fig 6aa854a5-ba7a-4bbf-8160-70adf7d3e59b"} [main] DEBUG LoggingInterceptor#Response body: {"id":8,"name":"Fig 6aa854a5-ba7a-4bbf-8160-70adf7d3e59b"} [main] DEBUG RestTemplate#Response 201 CREATED [main] DEBUG RestTemplate#Reading to [com.hcd.figureclient.service.dto.Figure] [main] DEBUG RestTemplate#HTTP POST http://localhost:8082/api/v1/figures [main] DEBUG RestTemplate#Accept=[application/json, application/*+json] [main] DEBUG RestTemplate#Writing [FigureRequest[name=Fig 6aa854a5-ba7a-4bbf-8160-70adf7d3e59b]] with org.springframework.http.converter.json.MappingJackson2HttpMessageConverter [main] DEBUG LoggingInterceptor#Request body : {"name":"Fig 6aa854a5-ba7a-4bbf-8160-70adf7d3e59b"} [main] DEBUG LoggingInterceptor#Response body: {"title":"Bad Request","status":400,"detail":"A Figure with the same 'name' already exists.","instance":"/api/v1/figures"} [main] DEBUG RestTemplate#Response 400 BAD_REQUEST And with that, a client implementation using RestTemplate is complete. RestClient Implementation The aim here, as stated from the beginning, is to be able to accomplish the same, but instead of using RestTemplate, to use a RestClient instance. As the LoggingInterceptor, AuthInterceptor and the CustomResponseErrorHandler can be reused, they are not changed, and the RestClient configured as below. Java @Bean public RestClient restClient(@Value("${figure.service.url}") String url, LoggingInterceptor loggingInterceptor, AuthInterceptor authInterceptor, CustomResponseErrorHandler customResponseErrorHandler) { return RestClient.builder() .baseUrl(url) .requestFactory(new BufferingClientHttpRequestFactory(new SimpleClientHttpRequestFactory())) .requestInterceptor(loggingInterceptor) .requestInterceptor(authInterceptor) .defaultStatusHandler(customResponseErrorHandler) .build(); } Then, the instance is injected into a new FigureClient implementation. Java @Service public class FigureRestClient implements FigureClient { private final RestClient restClient; public FigureRestClient(RestClient restClient) { this.restClient = restClient; } @Override public List<Figure> allFigures() { var figures = restClient.get() .retrieve() .body(Figure[].class); if (figures == null) { throw new CustomException("Could not get the figures."); } return List.of(figures); } @Override public Optional<Figure> oneFigure(long id) { var figure = restClient.get() .uri("/{id}", id) .retrieve() .body(Figure.class); return Optional.ofNullable(figure); } @Override public Figure createFigure(FigureRequest figureRequest) { var figure = restClient.post() .contentType(MediaType.APPLICATION_JSON) .body(figureRequest) .retrieve() .body(Figure.class); if (figure == null) { throw new CustomException("Could not create figure."); } return figure; } @Override public Figure updateFigure(long id, FigureRequest figureRequest) { var figure = restClient.put() .uri("/{id}", id) .contentType(MediaType.APPLICATION_JSON) .body(figureRequest) .retrieve() .body(Figure.class); if (figure == null) { throw new CustomException("Could not update figure."); } return figure; } @Override public void deleteFigure(long id) { restClient.delete() .uri("/{id}", id) .retrieve() .toBodilessEntity(); } @Override public Figure randomFigure() { var figure = restClient.get() .uri("/random") .retrieve() .body(Figure.class); if (figure == null) { throw new CustomException("Could not get a random figure."); } return figure; } } In addition to these, there is only one important step left: to test the client-server integration. In order to fulfill that, it is enough to replace the FigureRestTemplateClient instance with the FigureRestClient one above in the previous FigureClientTest. Java @SpringBootTest class FigureClientTest { @Autowired private FigureRestClient figureClient; ... } If running, for instance, the same createFigure() test, the client output is similar. Apparently, RestClient is not as generous (or verbose) as RestTemplate when it comes to logging, but there is room for improvement as part of the custom LoggingInterceptor. Plain Text [main] DEBUG DefaultRestClient#Writing [FigureRequest[name=Fig 1155fd2c-91fe-486d-aaa3-35bf682629d4]] as "application/json" with org.springframework.http.converter.json.MappingJackson2HttpMessageConverter [main] DEBUG LoggingInterceptor#Request body : {"name":"Fig 1155fd2c-91fe-486d-aaa3-35bf682629d4"} [main] DEBUG LoggingInterceptor#Response body: {"id":9,"name":"Fig 1155fd2c-91fe-486d-aaa3-35bf682629d4"} [main] DEBUG DefaultRestClient#Reading to [com.hcd.figureclient.service.dto.Figure] [main] DEBUG DefaultRestClient#Writing [FigureRequest[name=Fig 1155fd2c-91fe-486d-aaa3-35bf682629d4]] as "application/json" with org.springframework.http.converter.json.MappingJackson2HttpMessageConverter [main] DEBUG LoggingInterceptor#Request body : {"name":"Fig 1155fd2c-91fe-486d-aaa3-35bf682629d4"} [main] DEBUG LoggingInterceptor#Response body: {"title":"Bad Request","status":400,"detail":"A Figure with the same 'name' already exists.","instance":"/api/v1/figures"} That’s it, the migration from RestTemplate to RestClient is now complete. Conclusions When it comes to new synchronous API client implementations, I find RestClient the best choice mostly for its functional and fluent API style. For older projects, which had been started before Spring Framework version 6.1 (Spring Boot 3.2, respectively), introduced RestClient and most probably are still using RestTemplate, I consider the migration worth planning and doing (more details in [Resource 4]). Moreover, the possibility of reusing existing components (ClientHttpRequestInterceptors, ResponseErrorHandlers, etc.) is another incentive for such a migration. Ultimately, as a last resort, it is even possible to create a RestClient instance using the already configured RestTemplate and go from there, although I find this solution pretty tangled. Resources RestTemplate Spring Framework API ReferenceWebClient Spring Framework API ReferenceRestClient Spring Framework API ReferenceMigrating from RestTemplate to RestClientfigure-service source codefigure-client source codeThe picture was taken in Bucharest, Romania. More

Integrating AI Into Test Automation Frameworks With the ChatGPT API

By Serhii Romanov

When I first tried to implement AI in a test automation framework, I expected it to be helpful only for a few basic use cases. A few experiments later, I noticed several areas where the ChatGPT API actually saved me time and gave the test automation framework more power: producing realistic test data, analyzing logs in white-box tests, and handling flaky tests in CI/CD. Getting Started With the ChatGPT API ChatGPT API is a programming interface by OpenAI that operates on top of the HTTP(s) protocol. It allows sending requests and retrieving outputs from a pre-selected model as raw text, JSON, XML, or any other format you prefer to work with. The API documentation is clear enough to get started, with examples of request/response bodies that made the first call straightforward. In my case, I just generated an API key in the OpenAI developer platform and plugged it into the framework properties to authenticate requests. Building a Client for Integration With the API I built the integration in both Java and Python, and the pattern is the same: Send a POST with JSON and read the response, so it can be applied in almost any programming language. Since I prefer to use Java in automation, here is an example of what a client might look like: Java import java.net.http.*; import java.net.URI; import java.time.Duration; public class OpenAIClient { private final HttpClient http = HttpClient.newBuilder() .connectTimeout(Duration.ofSeconds(20)).build(); private final String apiKey; public OpenAIClient(String apiKey) { this.apiKey = apiKey; } public String chat(String userPrompt) throws Exception { String body = """ { "model": "gpt-5-mini", "messages": [ {"role":"system","content":"You are a helpful assistant for test automation..."}, {"role":"user","content": %s} ] } """.formatted(json(userPrompt)); HttpRequest req = HttpRequest.newBuilder() .uri(URI.create("https://api.openai.com/v1/chat/completions")) .timeout(Duration.ofSeconds(60)) .header("Authorization", "Bearer " + apiKey) .header("Content-Type", "application/json") .POST(HttpRequest.BodyPublishers.ofString(body)) .build(); HttpResponse<String> res = http.send(req, HttpResponse.BodyHandlers.ofString()); if (res.statusCode() >= 300) throw new RuntimeException(res.body()); return res.body(); } } As you probably have already noticed, one of the query parameters in the request body is the GPT model. Models differ in speed, cost, and capabilities: some are faster, while others are slower; some are expensive, while others are cheap, and some support multimodality, while others do not. Therefore, before integrating with the ChatGPT API, I recommend that you determine which model is best suited for performing tasks and set limits for it. On the OpenAI website, you can find a page where you can select several models and compare them to make a better choice. It will also probably be good to know that custom client implementation can also be extended to support server-sent streaming events to show results as they’re generated, and the Realtime API for multimodal purposes. This is what you can use for processing logs and errors in real time and identifying anomalies on the fly. Integration Architecture In my experience, integration with the ChatGPT API only makes sense in testing when applied to the correct problems. In my practice, I found three real-world scenarios I mentioned earlier, and now let’s take a closer look at them. Use Case 1: Test Data Generation The first use case I tried was a test data generation for automation tests. Instead of relying on hardcoded values, ChatGPT can provide strong and realistic data sets, ranging from user profiles with household information to unique data used in exact sciences. In my experience, this variety of data helped uncover issues that fixed or hardcoded data would never catch, especially around boundary values and rare edge cases. The diagram below illustrates how this integration with the ChatGPT API for generating test data works. At the initial stage, the TestNG Runner launches the suite, and before running the test, it goes to the ChatGPT API and requests test data for the automation tests. This test data will later be processed at the data provider level, and automated tests will be run against it with newly generated data and expected assertions. Java class TestUser { public String firstName, lastName, email, phone; public Address address; } class Address { public String street, city, state, zip; } public List<TestUser> generateUsers(OpenAIClient client, int count) throws Exception { String prompt = """ You generate test users as STRICT JSON only. Schema: {"users":[{"firstName":"","lastName":"","email":"","phone":"", "address":{"street":"","city":"","state":"","zip":""}]} Count = %d. Output JSON only, no prose. """.formatted(count); String content = client.chat(prompt); JsonNode root = new ObjectMapper().readTree(content); ArrayNode arr = (ArrayNode) root.path("users"); List<TestUser> out = new ArrayList<>(); ObjectMapper m = new ObjectMapper(); arr.forEach(n -> out.add(m.convertValue(n, TestUser.class))); return out; } This solved the problem of repetitive test data and helped to detect errors and anomalies earlier. The main challenge was prompt reliability, and if the prompt wasn’t strict enough, the model would add extra text that broke the JSON parser. In my case, the versioning of prompts was the best way to keep improvements under control. Use Case 2: Log Analysis In some recent open-source projects I came across, automated tests also validated system behavior by analyzing logs. In most of these tests, there is an expectation that a specific message should appear in the application console or in DataDog or Loggly, for example, after calling one of the REST endpoints. Such tests are needed when the team conducts white-box testing. But what if we take it a step further and try to send logs to ChatGPT, asking it to check the sequence of messages and identify potential anomalies that may be critical for the service? Such an integration might look like this: When an automated test pulls service logs (e.g., via the Datadog API), it groups them and sends a sanitized slice to the ChatGPT API for analysis. The ChatGPT API has to return a structured verdict with a confidence score. In case anomalies are flagged, the test fails and displays the reasons from the response; otherwise, it passes. This should keep assertions focused while catching unexpected patterns you didn’t explicitly code for. The Java code for this use case might look like this: Java //Redaction middleware (keep it simple and fast) public final class LogSanitizer { private LogSanitizer() {} public static String sanitize(String log) { if (log == null) return ""; log = log.replaceAll("(?i)(api[_-]?key\\s*[:=]\\s*)([a-z0-9-_]{8,})", "$1[REDACTED]"); log = log.replaceAll("([A-Za-z0-9-_]{20,}\\.[A-Za-z0-9-_]+\\.[A-Za-z0-9-_]+)", "[REDACTED_JWT]"); log = log.replaceAll("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+", "[REDACTED_EMAIL]"); return log; } } //Ask for a structured verdict record Verdict(String verdict, double confidence, List<String> reasons) {} public Verdict analyzeLogs(OpenAIClient client, String rawLogs) throws Exception { String safeLogs = LogSanitizer.sanitize(rawLogs); String prompt = """ You are a log-analysis assistant. Given logs, detect anomalies (errors, timeouts, stack traces, inconsistent sequences). Respond ONLY as JSON with this exact schema: {"verdict":"PASS|FAIL","confidence":0.0-1.0,"reasons":["...","..."]} Logs (UTC): ---------------- %s ---------------- """.formatted(safeLogs); // Chat with the model and parse the JSON content field String content = client.chat(prompt); ObjectMapper mapper = new ObjectMapper(); JsonNode jNode = mapper.readTree(content); String verdict = jNode.path("verdict").asText("PASS"); double confidence = jNode.path("confidence").asDouble(0.0); List<String> reasons = mapper.convertValue( jNode.path("reasons").isMissingNode() ? List.of() : jNode.path("reasons"), new com.fasterxml.jackson.core.type.TypeReference<List<String>>() {} ); return new Verdict(verdict, confidence, reasons); } Before implementing such an integration, it is important to remember that logs often contain sensitive information, which may include API keys, JWT tokens, or user email addresses. Therefore, sending raw logs to the cloud API is a security risk, and in this case, the data sanitization must be performed. That is why, in my example, I added a simple LogSanitizer middleware to sanitize sensitive data before sending these logs to the ChatGPT API. It is also important to understand that this approach does not replace traditional assertions, but complements them. You can use them instead of dozens of complex checks, allowing the model to detect abnormal behavior. The most important thing is to treat the ChatGPT API verdict as a recommendation and leave the final decision to the automated framework itself based on the specified threshold values. For example, consider a test a failure only if the confidence is higher than 0.8. Use Case 3: Test Stabilization One of the most common problems in test automation is the occurrence of flaky tests. Tests can fail for various reasons, including changes to the API contract or interface. However, the worst scenario is when tests fail due to an unstable testing environment. Typically, for such unstable tests, the teams usually enable retries, and the test is run multiple times until it passes or, conversely, fails after three unsuccessful attempts in a row. But what if we give artificial intelligence the opportunity to decide whether a test needs to be restarted or whether it can be immediately marked as failed or vice versa? Here’s how this idea can be applied in a testing framework: When a test fails, the first step is to gather as much context as possible, including the stack trace, service logs, environment configuration, and, if applicable, a code diff. All this data should be sent to the ChatGPT API for analysis to obtain a verdict, which is then passed to the AiPolicy. It is essential not to let ChatGPT make decisions independently. If the confidence level is high enough, AiPolicy can quarantine the test to prevent the pipeline from being blocked, and when the confidence level is below a specific value, the test can be re-tried or immediately marked as failed. I believe it is always necessary to leave the decision logic to the automation framework to maintain control over the test results, while still using AI-based integration. The main goal for this idea is to save time on analyzing unstable tests and reduce their number. Reports after processing data by ChatGPT become more informative and provide clearer insights into the root causes of failures. Conclusion I believe that integrating the ChatGPT API into a test automation framework can be an effective way to extend its capabilities, but there are compromises to this integration that need to be carefully weighed. One of the most important factors is cost. For example, in a set of 1,000 automated tests, of which about 20 fail per run, sending logs, stack traces, and environment metadata to the API can consume over half a million input tokens per run. Adding test data generation to this quickly increases token consumption. In my opinion, the key point is that the cost is directly proportional to the amount of data: the more you send, the more you pay. Another major issue I noticed is the security and privacy concerns. Logs and test data often contain sensitive information such as API keys, JWT tokens, or users' data, and sending raw data to the cloud is rarely acceptable in production. In practice, this means either using open-source LLMs like LLaMA deployed locally or providing a redaction/anonymization layer between your framework and the API so that sensitive fields are removed or replaced before anything leaves your testing environment. Model selection also plays a role. I've found that in many cases the best strategy is to combine them: using smaller models for routine tasks, and larger ones only where higher accuracy really matters. With these considerations in mind, the ChatGPT API can bring real value to testing. It helps generate realistic test data, analyze logs more intelligently, and makes it easier to manage flaky tests. The integration also makes reporting more informative, adding context and analytics that testers would otherwise have to research manually. As I have observed in practice, utilizing AI effectively requires controlling costs, protecting sensitive data, and maintaining decision-making logic within an automation framework to enable effective regulation of AI decisions. It reminds me of the early days of automation, when teams were beginning to weigh the benefits against the limitations to determine where the real value lay. More

Trend Report

Kubernetes in the Enterprise

Over a decade in, Kubernetes is the central force in modern application delivery. However, as its adoption has matured, so have its challenges: sprawling toolchains, complex cluster architectures, escalating costs, and the balancing act between developer agility and operational control. Beyond running Kubernetes at scale, organizations must also tackle the cultural and strategic shifts needed to make it work for their teams.As the industry pushes toward more intelligent and integrated operations, platform engineering and internal developer platforms are helping teams address issues like Kubernetes tool sprawl, while AI continues cementing its usefulness for optimizing cluster management, observability, and release pipelines.DZone's 2025 Kubernetes in the Enterprise Trend Report examines the realities of building and running Kubernetes in production today. Our research and expert-written articles explore how teams are streamlining workflows, modernizing legacy systems, and using Kubernetes as the foundation for the next wave of intelligent, scalable applications. Whether you're on your first prod cluster or refining a globally distributed platform, this report delivers the data, perspectives, and practical takeaways you need to meet Kubernetes' demands head-on.

Refcard #387

Getting Started With CI/CD Pipeline Security

By Sudip Sengupta

CORE

Getting Started With CI/CD Pipeline Security

Refcard #216

Java Caching Essentials

By Granville Barnett

How to Build Secure Knowledge Base Integrations for AI Agents

Done well, knowledge base integrations enable AI agents to deliver specific, context-rich answers without forcing employees to dig through endless folders. Done poorly, they introduce security gaps and permissioning mistakes that erode trust. The challenge for software developers building these integrations is that no two knowledge bases handle permissions the same way. One might gate content at the space level, another at the page level, and a third at the attachment level. Adding to these challenges, permissions aren't static. They change when people join or leave teams, switch roles, or when content owners update visibility rules. If your integration doesn't mirror these controls accurately and in real time, you risk exposing the wrong data to the wrong person. In building these knowledge base integrations ourselves, we've learned lots of practical tips for how to build secure, maintainable connectors that shorten the time to deployment without cutting corners on data security. 1. Treat Permissions as a First-Class Data Type Too many integration projects prioritize syncing content over permissions. This approach is backwards. Before your AI agent processes a single page, it should understand the permission model of the source system and be able to represent it internally. This means: Mapping every relevant permission scope in the source system (space, folder, page, attachment, comment).Representing permissions in your data model so your AI agent can enforce them before returning a result.Designing for exceptions. For example, if an article is generally public within a department but contains one restricted attachment, your connector should respect that partial restriction. For example, in a Confluence integration, you should check both space-level and page-level rules for each request. If you cache content to speed up retrieval, you must also cache the permissions and invalidate them promptly when they change. 2. Sync Permissions as Often as Content Permissions drift quickly. Someone might be promoted, transferred, or removed from a sensitive project, and the content they previously accessed is suddenly off-limits. Your AI agent should never rely on a stale permission snapshot. A practical approach is to tie permission updates to the same sync cadence as content updates. If you're fetching new or updated articles every five minutes, refresh the associated access control lists (ACLs) on the same schedule. If the source system supports webhooks or event subscriptions for permission changes, use them to trigger targeted re-syncs. 3. Respect the Principle of Least Privilege in Responses Enforcing permissions also shapes what your AI agent returns. For example, say your AI agent receives the query, "What are the latest results from our employee engagement survey?" The underlying knowledge base contains a page with survey results visible only to HR and executives. Even if the query perfectly matches the page's content, the agent should respond with either no result or a message indicating that the content is restricted. This means filtering retrieved documents at query time based on the current user's identity and permissions, not just when content is first synced. Retrieval-augmented generation (RAG) pipelines need this filter stage before passing context to the LLM. 4. Normalize Data Without Flattening Security Every knowledge base stores content differently, whether that's nested pages in Confluence, blocks in Notion, or articles in Zendesk. Normalizing these formats makes it easier for your AI agent to handle multiple systems. But normalization should never strip away the original permission structures. For instance, when creating a unified search index, store both the normalized text and the original system's permission metadata. Your query service can then enforce the correct rules regardless of which source system the content came from. 5. Handle Hierarchies and Inheritance Carefully Most systems allow permission inheritance, where you grant access to a top-level space, and then all child pages inherit those rights unless overridden. Your connector must understand and replicate this logic. For example, with an internal help desk AI agent, a "VPN Troubleshooting" article may inherit view rights from its parent "Network Resources" space. But if someone restricts that one article to a smaller group, your integration must override the inherited rule and enforce the more restrictive setting. 6. Test With Realistic, Complex Scenarios Permission bugs often hide in edge cases: Mixed inheritance and explicit restrictionsUsers with multiple, overlapping rolesAttachments with different permissions than their parent page Developers should build a test harness that mirrors these conditions using anonymized or synthetic data. Validate not only that your AI agent can fetch the right content, but that it never exposes restricted data, even when queried indirectly ("What did the survey results say about the marketing team?"). 7. Build for Ongoing Maintenance A secure, reliable knowledge base integration isn't a "set it and forget it" feature. It's an active part of your AI agent's architecture. Once deployed, knowledge base integrations require constant upkeep: API version changes, evolving permission models, and shifts in organizational structure. Assign ownership for monitoring and updating each connector, and automate regression tests for permission enforcement. Document your mapping between source-system roles and internal permission groups so that changes can be made confidently when needed. By giving permissions the same engineering rigor as content retrieval, you protect sensitive data and preserve trust in the system. That trust is what ultimately allows these AI agents to be embedded into the real workflows where they deliver the most value. You may be looking at the steps involved in building knowledge base connectors and wonder why they matter. When implemented well, they can transform workflows: Enterprise AI search: By integrating with a company's wiki, CRM, and file storage, a search agent can answer multi-step queries like, "What's the status of the Acme deal?" pulling from sales notes, internal strategy docs, and shared project plans. Permissions ensure that deal details remain visible only to the account team.IT help desk agent: When connected to a knowledge base, the agent can deliver precise, step-by-step troubleshooting guides to employees. If a VPN setup page is restricted to IT staff, the agent won't surface it to non-IT users.New hire onboarding bot: Integrated with the company wiki and messaging platform, an agent can answer questions about policies, teams, and tools. Each answer is filtered through the same rules that would apply if the employee searched manually. These examples work not because the AI agent "knows everything," but because it knows how to retrieve the right things for the right person at the right time. As knowledge base products become the standard for AI agents, it's critical to manage integrations in a way that prioritizes data security and trust.

By Gil Feig

Isolation Level for MongoDB Multi-Document Transactions (Strong Consistency)

Many outdated or imprecise claims about transaction isolation levels in MongoDB persist. These claims are outdated because they may be based on an old version where multi-document transactions were introduced, MongoDB 4.0, such as the old Jepsen report, and issues have been fixed since then. They are also imprecise because people attempt to map MongoDB's transaction isolation to SQL isolation levels, which is inappropriate, as the SQL Standard definitions ignore Multi-Version Concurrency Control (MVCC), utilized by most databases, including MongoDB. Martin Kleppmann has discussed this issue and provided tests to assess transaction isolation and potential anomalies. I will conduct these tests on MongoDB to explain how multi-document transactions work and avoid anomalies. I followed the structure of Martin Kleppmann's tests on PostgreSQL and ported them to MongoDB. The read isolation level in MongoDB is controlled by the Read Concern, and the "snapshot" read concern is the only one comparable to other Multi-Version Concurrency Control SQL databases, and maps to Snapshot Isolation, improperly called Repeatable Read to use the closest SQL standard term. As I test on a single-node lab, I use "majority" to show that it does more than Read Committed. The write concern should also be set to "majority" to ensure that at least one node is common between the read and write quorums. Recap on Isolation Levels in MongoDB Let me explain quickly the other isolation levels and why they cannot be mapped to the SQL standard: readConcern: { level: "local" } is sometimes compared to Uncommitted Reads because it may show a state that can be later rolled back in case of failure. However, some SQL databases may show the same behavior in some rare conditions (example here) and still call that Read CommittedreadConcern: { level: "majority" } is sometimes compared to Read Committed, because it avoids uncommitted reads. However, Read Committed was defined for wait-on-conflict databases to reduce the lock duration in two-phase locking, but MongoDB multi-document transactions use fail-on-conflict to avoid waits. Some databases consider that Read Committed can allow reads from multiple states (example here) while some others consider it must be a statement-level snapshot isolation (examples here). In a multi-shard transaction, majority may show a result from multiple states, as snapshot is the one being timeline consistent.readConcern: { level: "snapshot" } is the real equivalent to Snapshot Isolation, and prevents more anomalies than Read Committed. Some databases even call that "serializable" (example here) because the SQL standard ignores the write-skew anomaly.readConcern: { level: "linearlizable" } is comparable to serializable, but for a single document, not available for multi-document transactions, similar to many SQL databases that do not provide serializable as it reintroduces scalability the problems of read locks, that MVCC avoids. Read Committed Basic Requirements (G0, G1a, G1b, G1c) Here are some tests for anomalies typically prevented in Read Committed. I'll run them with readConcern: { level: "majority" } but keep in mind that readConcern: { level: "snapshot" } may be better if you want a consistent snapshot across multiple shards. MongoDB Prevents Write Cycles (G0) With Conflict Error JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); In a two-phase locking database, with wait-on-conflict behavior, the second transaction would wait for the first one to avoid anomalies. However, MongoDB with transactions is fail-on-conflict and raises a retriable error to avoid the anomaly. Each transaction touched only one document, but it was declared explicitly with a session and startTransaction(), to allow multi-document transactions, and this is why we observed the fail-on-conflict behavior to let the application apply its retry logic for complex transactions. If the conflicting update was run as a single-document transaction, equivalent to an auto-commit statement, it would have used a wait-on-conflict behavior. I can test it by immediately running this while the t1 transaction is still active: JavaScript const db = db.getMongo().getDB("test_db"); print(`Elapsed time: ${ ((startTime = new Date()) && db.test.updateOne({ _id: 1 }, { $set: { value: 12 } })) && (new Date() - startTime) } ms`); Elapsed time: 72548 ms I've run the updateOne({ _id: 1 }) without an implicit transaction. It waited for the other transaction to terminate, which happened after a 60-second timeout, and then the update was successful. The first transaction that timed out is aborted: JavaScript session1.commitTransaction(); MongoServerError[NoSuchTransaction]: Transaction with { txnNumber: 2 } has been aborted. The behavior of conflict in transactions differs: wait-on-conflict for implicit single-document transactionsfail-on-conflict for explicit multiple-document transactions immediately, resulting in a transient error, without waiting, to let the application rollback and retry. MongoDB Prevents Aborted Reads (G1a) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 101 } }); T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] session1.abortTransaction(); T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] session2.commitTransaction(); MongoDB prevents reading an aborted transaction by reading only the committed value when Read Concern is 'majority' or 'snapshot.' MongoDB Prevents Intermediate Reads (G1b) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 101 } }); T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] The non-committed change from T1 is not visible to T2. JavaScript T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); session1.commitTransaction(); // T1 commits T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] The committed change from T1 is still not visible to T2 because it happened after T2 started. This is different from the majority of Multi-Version Concurrency Control SQL databases. To minimize the performance impact of wait-on-conflict, they reset the read time before each statement in Read Committed, as phantom reads are allowed. They would have displayed the newly committed value with this example. MongoDB never does that; the read time is always the start of the transaction, and no phantom read anomaly happens. However, it doesn't wait to see if the conflict is resolved or must fail with a deadlock, and fails immediately to let the application retry it. MongoDB Prevents Circular Information Flow (G1c) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 22 } }); T1.test.find({ _id: 2 }); [ { _id: 2, value: 20 } ] T2.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] session1.commitTransaction(); session2.commitTransaction(); In both transactions, the uncommitted changes are not visible to others. MongoDB Prevents Observed Transaction Vanishes (OTV) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T3 const session3 = db.getMongo().startSession(); const T3 = session3.getDatabase("test_db"); session3.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T1.test.updateOne({ _id: 2 }, { $set: { value: 19 } }); T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. This anomaly is prevented by fail-on-conflict with an explicit transaction. With an implicit single-document transaction, it would have to wait for the conflicting transaction to end. MongoDB Prevents Predicate-Many-Preceders (PMP) With a SQL database, this anomaly would require the Snapshot Isolation level because Read Committed uses different read times per statement. However, I can show it in MongoDB with 'majority' read concern, 'snapshot' being required only to get cross-shard snapshot consistency. JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ value: 30 }).toArray(); [] T2.test.insertOne( { _id: 3, value: 30 } ); session2.commitTransaction(); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [] The newly inserted row is not visible because it was committed by T2 after the start of T1. Martin Kleppmann's tests include some variations with a delete statement and a write predicate: JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateMany({}, { $inc: { value: 10 } }); T2.test.deleteMany({ value: 20 }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. As it is an explicit transaction, rather than blocking, the delete detects the conflict and raises a retriable exception to prevent the anomaly. Compared to PostgreSQL, which prevents that in Repeatable Read, it saves the waiting time before failure, but requires the application to implement a retry logic. MongoDB Prevents Lost Update (P4) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. As it is an explicit transaction, the update doesn't wait and raises a retriable exception, so that it is impossible to overwrite the other update without waiting for its completion. MongoDB Prevents Read Skew (G-single) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find({ _id: 2 }); [ { _id: 2, value: 20 } ] T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 18 } }); session2.commitTransaction(); T1.test.find({ _id: 2 }); [ { _id: 2, value: 20 } ] In SQL databases with Read Committed isolation, a read skew anomaly could display the value 18. However, MongoDB avoids this issue by reading the same value of 20 consistently throughout the transaction, as it reads data as of the start of the transaction. Martin Kleppmann's tests include a variation with predicate dependency: JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.findOne({ value: { $mod: [5, 0] } }); { _id: 1, value: 10 } T2.test.updateOne({ value: 10 }, { $set: { value: 12 } }); session2.commitTransaction(); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [] The uncommitted value 12 which is a multiple of 3 is not visible to the transaction that started before. Another test includes a variation with a write predicate in a delete statement: JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 18 } }); session2.commitTransaction(); T1.test.deleteMany({ value: 20 }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. This read skew anomaly is prevented by the fail-on-conflict behavior when writing a document that has uncommitted changes from another transaction. Write Skew (G2-item) Must Be Managed by the Application JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: { $in: [1, 2] } }) [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] T2.test.find({ _id: { $in: [1, 2] } }) [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] T2.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 21 } }); session1.commitTransaction(); session2.commitTransaction(); MongoDB doesn't detect the read/write conflict when one transaction has read a value updated by the other, and then writes something that may have depended on this value. The Read Concern doesn't provide the Serializable guarantee. Such isolation requires acquiring range or predicate locks during reads, and doing it prematurely would hinder the performance of a database designed to scale. For the transactions that need to avoid this, the application can transform the read/write conflict to a write/write conflict by updating a field in the document that was read to be sure that other transactions do not modify it. Or re-check the value when updating. Anti-Dependency Cycles (G2) Must Be Managed by the Application JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } }); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [] T2.test.find({ value: { $mod: [3, 0] } }).toArray(); [] T1.test.insertOne( { _id: 3, value: 30 } ); T1.test.insertOne( { _id: 4, value: 42 } ); session1.commitTransaction(); session2.commitTransaction(); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [ { _id: 3, value: 30 }, { _id: 4, value: 42 } ] The read/write conflict was not detected, and both transactions were able to write, even if they may have depended on a previous read that had been modified by the other transaction. MongoDB does not acquire locks across read and write calls. If you run a multi-document transaction where the writes depend on the reads, the application must explicitly write to the read set in order to detect the write conflict and avoid the anomaly. All those tests were based on https://github.com/ept/hermitage. There's a lot of information about MongoDB transactions in the MongoDB Multi-Document ACID Transactions whitepaper from 2020. While the document model offers simplicity and performance when a single document matches the business transaction, MongoDB supports multi-statement transactions with Snapshot Isolation, similar to many SQL databases using Multi-Version Concurrency Control (MVCC), but favoring fail-on-conflict rather than wait. Despite outdated myths surrounding NoSQL or based on old versions, its transaction implementation is robust and effectively prevents common transactional anomalies.

By Franck Pachot

Spring Boot WebSocket: Building a Multichannel Chat in Java

As you may have already guessed from the title, the topic for today will be Spring Boot WebSockets. Some time ago, I provided an example of WebSocket chat based on Akka toolkit libraries. However, this chat will have somewhat more features, and a quite different design. I will skip some parts so as not to duplicate too much content from the previous article. Here you can find a more in-depth intro to WebSockets. Please note that all the code that’s used in this article is also available in the GitHub repository. Spring Boot WebSocket: Tools Used Let’s start the technical part of this text with a description of the tools that will be further used to implement the whole application. As I cannot fully grasp how to build a real WebSocket API with classic Spring STOMP overlay, I decided to go for Spring WebFlux and make everything reactive. Spring Boot – No modern Java app based on Spring can exist without Spring Boot; all the autoconfiguration is priceless.Spring WebFlux – A reactive version of classic Spring, provides quite a nice and descriptive toolkit for handling both WebSockets and REST. I would dare to say that it is the only way to actually get WebSocket support in Spring.Mongo – One of the most popular NoSQL databases, I am using it for storing message history.Spring Reactive Mongo – Spring Boot starter for handling Mongo access in a reactive fashion. Using reactive in one place but not the other is not the best idea. Thus, I decided to make DB access reactive as well. Let’s start the implementation! Spring Boot WebSocket: Implementation Dependencies and Config pom.xml XML <dependencies>  <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-webflux</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-mongodb-reactive</artifactId> </dependency> </dependencies> application.properties Properties files spring.data.mongodb.uri=mongodb://chats-admin:admin@localhost:27017/chats I prefer .properties over .yml — In my honest opinion, YAML is not readable and non-maintainable on a larger scale. WebSocketConfig Java @Configuration class WebSocketConfig { @Bean ChatStore chatStore(MessagesStore messagesStore) { return new DefaultChatStore(Clock.systemUTC(), messagesStore); } @Bean WebSocketHandler chatsHandler(ChatStore chatStore) { return new ChatsHandler(chatStore); } @Bean SimpleUrlHandlerMapping handlerMapping(WebSocketHandler wsh) { Map<String, WebSocketHandler> paths = Map.of("/chats/{id}", wsh); return new SimpleUrlHandlerMapping(paths, 1); } @Bean WebSocketHandlerAdapter webSocketHandlerAdapter() { return new WebSocketHandlerAdapter(); } } And surprise, all four beans defined here are very important. ChatStore – Custom bean for operating on chats, I will go into more details on this bean in the following steps.WebSocketHandler – Bean that will store all the logic related to handling WebSocket sessions.SimpleUrlHandlerMapping – Responsible for mapping URLs to correct handler full URL for this one will look more or less like this ws://localhost:8080/chats/{id}.WebSocketHandlerAdapter – A kind of capability bean it adds WebSockets handling support to Spring Dispatcher Servlet. ChatsHandler Java class ChatsHandler implements WebSocketHandler { private final Logger log = LoggerFactory.getLogger(ChatsHandler.class); private final ChatStore store; ChatsHandler(ChatStore store) { this.store = store; } @Override public Mono handle(WebSocketSession session) { String[] split = session.getHandshakeInfo() .getUri() .getPath() .split("/"); String chatIdStr = split[split.length - 1]; int chatId = Integer.parseInt(chatIdStr); ChatMeta chatMeta = store.get(chatId); if (chatMeta == null) { return session.close(CloseStatus.GOING_AWAY); } if (!chatMeta.canAddUser()) { return session.close(CloseStatus.NOT_ACCEPTABLE); } String sessionId = session.getId(); store.addNewUser(chatId, session); log.info("New User {} join the chat {}", sessionId, chatId); return session .receive() .map(WebSocketMessage::getPayloadAsText) .flatMap(message -> store.addNewMessage(chatId, sessionId, message)) .flatMap(message -> broadcastToSessions(sessionId, message, store.get(chatId).sessions()) .doFinally(sig -> store.removeSession(chatId, session.getId())) .then(); } private Mono broadcastToSessions(String sessionId, String message, List sessions) { return sessions .stream() .filter(session -> !session.getId().equals(sessionId)) .map(session -> session.send(Mono.just(session.textMessage(message)))) .reduce(Mono.empty(), Mono::then); } } As I mentioned above, here you can find all the logic related to handling WebSocket sessions. First, we parse the ID of a chat from the URL to get the target chat. Responding with different statuses depends on the context present for a particular chat. Additionally, I am also broadcasting the message to all the sessions related to particular chat — for users to actually exchange the messages. I have also added doFinally trigger that will clear closed sessions from the chatStore, to reduce redundant communication. As a whole, this code is reactive; there are some restrictions I need to follow. I have tried to make it as simple and readable as possible, if you have any idea how to improve it I am open. ChatsRouter Java @Configuration(proxyBeanMethods = false) class ChatRouter { private final ChatStore chatStore; ChatRouter(ChatStore chatStore) { this.chatStore = chatStore; } @Bean RouterFunction routes() { return RouterFunctions .route(POST("api/v1/chats/create"), e -> create(false)) .andRoute(POST("api/v1/chats/create-f2f"), e -> create(true)) .andRoute(GET("api/v1/chats/{id}"), this::get) .andRoute(DELETE("api/v1/chats/{id}"), this::delete); } } WebFlux's approach to defining REST endpoints is quite different from the classic Spring. Above, you can see the definition of 4 endpoints for managing chats. As similar as in the case of Akka implementation I want to have a REST API for managing Chats and WebSocket API for actual handling chats. I will skip the function implementations as they are pretty trivial; you can see them on GitHub. ChatStore First, the interface: Java public interface ChatStore { int create(boolean isF2F); void addNewUser(int id, WebSocketSession session); Mono addNewMessage(int id, String userId, String message); void removeSession(int id, String session); ChatMeta get(int id); ChatMeta delete(int id); Then the implementation: Java public class DefaultChatStore implements ChatStore { private final Map<Integer, ChatMeta> chats; private final AtomicInteger idGen; private final MessagesStore messagesStore; private final Clock clock; public DefaultChatStore(Clock clock, MessagesStore store) { this.chats = new ConcurrentHashMap<>(); this.idGen = new AtomicInteger(0); this.clock = clock; this.messagesStore = store; } @Override public int create(boolean isF2F) { int newId = idGen.incrementAndGet(); ChatMeta chatMeta = chats.computeIfAbsent(newId, id -> { if (isF2F) { return ChatMeta.ofId(id); } return ChatMeta.ofIdF2F(id); }); return chatMeta.id; } @Override public void addNewUser(int id, WebSocketSession session) { chats.computeIfPresent(id, (k, v) -> v.addUser(session)); } @Override public void removeSession(int id, String sessionId) { chats.computeIfPresent(id, (k, v) -> v.removeUser(sessionId)); } @Override public Mono addNewMessage(int id, String userId, String message) { ChatMeta meta = chats.getOrDefault(id, null); if (meta != null) { Message messageDoc = new Message(id, userId, meta.offset.getAndIncrement(), clock.instant(), message); return messagesStore.save(messageDoc) .map(Message::getContent); } return Mono.empty(); } // omitted The base of ChatStore is the ConcurrentHashMap that holds the metadata of all open chats. Most of the methods from the interface are self-explanatory, and there is nothing special behind them. create – Creates a new chat with a bool attribute denoting if the chat is f2f or group.addNewUser – Adds a new user to existing chats.removeUser – Removes a user from the existing chat.get – Gets the metadata of a chat with an ID.delete – Deletes the chat from CMH. The only complex method here is addNewMessages. It increments the message counter within the chat and persists message content in MongoDB, for durability. MongoDB Message Entity Java public class Message { @Id private String id; private int chatId; private String owner; private long offset; private Instant timestamp; private String content; A model for message content stored in a database, there are three important fields here: chatId – Represent chat in which a particular message was sent.ownerId – The userId of the message sender.offset – Ordinal number of message within the chat, for retrieval ordering. MessageStore Java public interface MessagesStore extends ReactiveMongoRepository<Message, String> {} Nothing special, classic Spring Repository, but in a reactive fashion, provides the same set of features as JpaRepository. It is used directly in ChatStore. Additionally, in the main application class, WebsocketsChatApplication, I am activating reactive repositories by using @EnableReactiveMongoRepositories. Without this annotation messageStore from above would not work. And here we go, we have the whole chat implemented. Let’s test it! Spring Boot WebSocket: Testing For tests, I’m using Postman and Simple WebSocket Client. I’m creating a new chat using Postman. In the response body, I got a WebSocket URL to the recently created chat. Now it is time to use them and check if users can communicate with one another. Simple Web Socket Client comes into play here. Thus, I am connecting to the newly created chat here. Here we are, everything is working, and users can communicate with each other. There is one last thing to do. Let’s spend a moment looking at things that can be done better. What Can Be Done Better As what I have just built is the most basic chat app, there are a few (or in fact quite a lot) things that may be done better. Below, I have listed the things I find worthy of improvement: Authentication and rejoining support – Right now, everything is based on the sessionId. It is not an optimal approach. It would be better to have some authentication in place and actual rejoining based on user data.Sending attachments – For now, the chat only supports simple text messages. While texting is the basic function of a chat, users enjoy exchanging images and audio files, too.Tests – There are no tests for now, but why leave it like this? Tests are always a good idea.Overflow in offset – Currency, it is a simple int. If we were to track the offset for a very long time, it would overflow sooner or later. Summary Et voilà! The Spring Boot WebSocket chat is implemented, and the main task is done. You have some ideas on what to develop in the next steps. Please keep in mind that this chat case is very simple, and it will require lots of changes and development for any type of commercial project. Anyway, I hope that you learned something new while reading this article. Thank you for your time. These other resources might interest you: Lock-Free Programming in Java7 API Integration Patterns

By Bartłomiej Żyliński

CORE

Your SDLC Has an Evil Twin — and AI Built It

You think you know your SDLC like the back of your carpal-tunnel-riddled hand: You've got your gates, your reviews, your carefully orchestrated dance of code commits and deployment pipelines. But here's a plot twist straight out of your auntie's favorite daytime soap: there's an evil twin lurking in your organization (cue the dramatic organ music). It looks identical to your SDLC — same commits, same repos, the same shiny outputs flowing into production. But this fake-goatee-wearing doppelgänger plays by its own rules, ignoring your security governance and standards. Welcome to the shadow SDLC — the one your team built with AI when you weren't looking: It generates code, dependencies, configs, and even tests at machine speed, but without any of your governance, review processes, or security guardrails. Checkmarx’s August Future of Application Security report, based on a survey of 1,500 CISOs, AppSec managers, and developers worldwide, just pulled back the curtain on this digital twin drama: 34% of developers say more than 60% of their code is now AI-generated. Only 18% of organizations have policies governing AI use in development. 26% of developers admit AI tools are being used without permission. It’s not just about insecure code sneaking into production, but rather about losing ownership of the very processes you’ve worked to streamline. Your “evil twin” SDLC comes with: Unknown provenance → You can’t always trace where AI-generated code or dependencies came from. Inconsistent reliability → AI may generate tests or configs that look fine but fail in production. Invisible vulnerabilities → Flaws that never hit a backlog because they bypass reviews entirely. This isn’t a story about AI being “bad”, but about AI moving faster than your controls — and the risk that your SDLC’s evil twin becomes the one in charge. The rest of this article is about how to prevent that. Specifically: How the shadow SDLC forms (and why it’s more than just code)The unique risks it introduces to security, reliability, and governanceWhat you can do today to take back ownership — without slowing down your team How the Evil Twin SDLC Emerges The evil twin isn’t malicious by design — it’s a byproduct of AI’s infiltration into nearly every stage of development: Code creation – AI writes large portions of your codebase at scale. Dependencies – AI pulls in open-source packages without vetting versions or provenance. Testing – AI generates unit tests or approves changes that may lack rigor. Configs and infra – AI auto-generates Kubernetes YAMLs, Dockerfiles, Terraform templates. Remediation – AI suggests fixes that may patch symptoms while leaving root causes. The result is a pipeline that resembles your own — but lacks the data integrity, reliability, and governance you’ve spent years building. Sure, It’s a Problem. But Is It Really That Bad? You love the velocity that AI provides, but this parallel SDLC compounds risk by its very nature. Unlike human-created debt, AI can replicate insecure patterns across dozens of repos in hours. And the stats from the FOA report speak for themselves: 81% of orgs knowingly ship vulnerable code — often to meet deadlines. 33% of developers admit they “hope vulnerabilities won’t be discovered” before release. 98% of organizations experienced at least one breach from vulnerable code in the past year — up from 91% in 2024 and 78% in 2023. The share of orgs reporting 4+ breaches jumped from 16% in 2024 to 27% in 2025. That surge isn’t random. It correlates with the explosive rise of AI use in development. As more teams hand over larger portions of code creation to AI without governance, the result is clear: risk is scaling at machine speed, too. Taking Back Control From the Evil Twin You can’t stop AI from reshaping your SDLC. But you can stop it from running rogue. Here’s how: 1. Establish Robust Governance for AI in Development Whitelist approved AI tools with built-in scanning and keep a lightweight approval workflow so devs don’t default to Shadow AI. Enforce provenance standards like SLSA or SBOMs for AI-generated code. Audit usage & tag AI contributions — use CodeQL to detect AI-generated code patterns and require devs to mark AI commits for transparency. This builds reliability and integrity into the audit trail. 2. Strengthen Supply Chain Oversight AI assistants are now pulling in OSS dependencies you didn’t choose — sometimes outdated, sometimes insecure, sometimes flat-out malicious. While your team already uses hygiene tools like Dependabot or Renovate, they’re only table stakes that don’t provide governance. They won’t tell you if AI just pulled in a transitive package with a critical vulnerability, or if your dependency chain is riddled with license risks. That’s why modern SCA is essential in the AI era. It goes beyond auto-bumping versions to: Generate SBOMs for visibility into everything AI adds to your repos. Analyze transitive dependencies several layers deep. Provide exploitable-path analysis so you prioritize what’s actually risky. Auto-updaters are hygiene. SCA is resilience. 3. Measure and Manage Debt Velocity Track debt velocity — measure how fast vulnerabilities are introduced and fixed across repos. Set sprint-based SLAs — if issues linger, AI will replicate them across projects before you’ve logged the ticket. Flag AI-generated commits for extra review to stop insecure patterns from multiplying. Adopt Agentic AI AppSec Assistants — The FOA report highlights that traditional remediation cycles can’t keep pace with machine-speed risk, making autonomous prevention and real-time remediation a necessity, not a luxury. 4. Foster a Culture of Reliable AI Use Train on AI risks like data poisoning and prompt injection. Make secure AI adoption part of the “definition of done.” Align incentives with delivery, not just speed. Create a reliable feedback loop — encourage devs to challenge governance rules that hurt productivity. Collaboration beats resistance. 5. Build Resilience for Legacy Systems Legacy apps are where your evil twin SDLC hides best. With years of accumulated debt and brittle architectures, AI-generated code can slip in undetected. These systems were built when cyber threats were far less sophisticated, lacking modern security features like multi-factor authentication, advanced encryption, and proper access controls. When AI is bolted onto these antiquated platforms, it doesn't just inherit the existing vulnerabilities, but can rapidly propagate insecure patterns across interconnected systems that were never designed to handle AI-generated code. The result is a cascade effect where a single compromised AI interaction can spread through poorly-secured legacy infrastructure faster than your security team can detect it. Here’s what’s often missed: Manual before automatic: Running full automation on legacy repos without a baseline can drown teams in false positives and noise. Start with manual SBOMs on the most critical apps to establish trust and accuracy, then scale automation. Triage by risk, not by age: Not every legacy system deserves equal attention. Prioritize repos with heavy AI use, repeated vulnerability patterns, or high business impact. Hybrid skills are mandatory: Devs need to learn how to validate AI-generated changes in legacy contexts, because AI doesn’t “understand” old frameworks. A dependency bump that looks harmless in 2025 might silently break a 2012-era API. Conclusion: Bring the ‘Evil Twin’ Back into the Family The “evil twin” of your SDLC isn’t going away. It’s already here, writing code, pulling dependencies, and shaping workflows. The question is whether you’ll treat it as an uncontrolled shadow pipeline — or bring it under the same governance and accountability as your human-led one. Because in today’s environment, you don’t just own the SDLC you designed. You also own the one AI is building — whether you control it or not. Interested to learn more about SDLC challenges in 2025 and beyond? More stats and insights are available in the Future of Appsec report mentioned above.

By Eran Kinsbruner

VS Code Agent Mode: An Architect's Perspective for the .NET Ecosystem

GitHub Copilot agent mode had several enhancements in VS Code as part of its July 2025 release, further bolstering its capabilities. The supported LLMs are getting better iteratively; however, both personal experience and academic research remain divided on future capabilities and gaps. I've had my own learnings exploring agent mode for the last few months, ever since it was released, and had the best possible outcomes with Claude Sonnet Models. After 18 years of building enterprise systems — ranging from integrating siloed COTS to making clouds talk, architecting IoT telemetry data ingestions and eCommerce platforms — I've seen plenty of "revolutionary" tools come and go. I've watched us transition from monoliths to microservices, from on-premises to cloud, from waterfall to agile. I've learned Java 1.4, .NET 9, and multiple flavors of JavaScript. Each transition revealed fundamental flaws in how we think about software construction. The integration of generative AI into software engineering is dominated by pattern matching and reasoning by analogy to past solutions. This approach is philosophically and practically flawed. There's active academic research that surfaces this problem, primarily the "Architectures of Error" framework that systematically differentiates the failure modes of human and AI-generated code. At the moment, I'm neither convinced by Copilot's capability nor have I found reasons to hate it. My focus in this article is more on the human-side errors that Agent Mode helps us recognize. Why This Isn't Just Another AI Tool Copilot's Agent Mode isn't just influencing how we build software — it's revealing why our current approaches are fundamentally flawed. The uncomfortable reality: Much of our architectural complexity exists because we've never had effective ways to encode and enforce design intent. We write architectural decision records that few read. We create coding standards that get violated under pressure. We design patterns that work beautifully when implemented correctly but fail catastrophically when they're not. Agent Mode surfaces this gap between architectural intent and implementation reality in ways we haven't experienced before. The Constraint Problem We've Been Avoiding Here's something I've learned from working on dozens of enterprise projects: Most architectural failures aren't technical failures — they're communication failures. We design a beautiful hexagonal architecture, document it thoroughly, and then watch as business pressure, tight deadlines, and human misunderstanding gradually erode it. By year three, what we have bears little resemblance to what we designed. C# // What we designed public class CustomerService : IDomainService<Customer> { // Clean separation, proper dependencies } // What we often end up with after several iterations public class CustomerService { // Direct database calls mixed with business logic // Scattered validation, unclear responsibilities // Works, but violates every architectural principle } Agent Mode forces us to confront this differently. AI can't read between the lines or make intuitive leaps. If our architectural constraints aren't explicit enough for an AI to follow, they probably aren't explicit enough for humans either. The Evolution from Documentation to Constraints In my experience, the most successful architectural approaches have moved progressively toward making correct usage easy and incorrect usage difficult. Early in my career, I relied heavily on documentation and code reviews. Later, I discovered the power of types, interfaces, and frameworks that guide developers toward correct implementations. Now, I'm exploring how to encode architectural knowledge directly into development tooling (and Copilot). C# / Evolution 1: Documentation-based (fragile) // "Please ensure all controllers inherit from BaseApiController" // Evolution 2: Framework-based (better) public abstract class BaseApiController : ControllerBase { // Common functionality, but still optional } // Evolution 3: Constraint-based (AI-compatible) public interface IApiEndpoint<TRequest, TResponse> where TRequest : IValidated where TResponse : IResult { // Impossible to create endpoints that bypass validation } The key insight: Each evolution makes architectural intent more explicit and mechanical. Agent Mode simply pushes us further along this path. We can work around most AI problems like the "AI 90/10 problem" arising from hallucinated APIs, non-existent libraries, context-window myopia, systematic pattern propagation, and model drift. LLM responses are probabilistic by nature, but they can be made deterministic by specifying constraints. Practical Implications Working with Agent Mode on real projects has revealed several practical patterns: 1. Requirement Specification Vague prompts produce (architecturally) inconsistent results. This isn't a limitation — it's feedback about the clarity of our thinking at any role, especially around SDLC, including the architect. We struggled with the same problems with the advent of the outsourcing era, too. SaaS inherits this problem through its extensibility and flexibility. Markdown [BAD] Inviting infinite possibilities: "Create a service for managing customers relationship" [GOOD] More effective: "Create a CustomerService implementing IDomainService<Customer> with validation using FluentValidation and error handling via Result<T> pattern" 2. The Composability Test If AI struggles to combine your architectural patterns correctly, human developers probably do too. They excel at pattern matching but fail at: Systematicity: Applying rules consistently across contextsProductivity: Scaling to larger, more complex compositionsSubstitutivity: Swapping components while maintaining correctnessLocalism: Understanding global vs. local scope implications This also helps to identify the architectural complexity. 3. The Constraint Discovery Process Working with AI has helped me identify implicit assumptions in existing architectures that weren't previously explicit. These discoveries often lead to better human-to-human communication as well. The Skills That Remain Valuable Based on my experience so far, certain architectural skills have become more important now: Domain understanding: AI can generate technically correct code, but understanding business context and constraints remains fundamentally human.Pattern recognition: Identifying when existing patterns apply and when new ones are needed becomes crucial for defining AI constraints.System thinking: Understanding emergent behaviors and system-level properties remains beyond current AI capabilities.Trade-off analysis: Evaluating architectural decisions based on business context, team capabilities, and long-term maintainability. What's Actually Changing The shift isn't as dramatic as "AI replacing architects or developers." It's more subtle: From implementation to intent: Less time writing boilerplate, more time clarifying what we actually want the system to do.From review to prevention: Instead of catching architectural violations in code review, we encode constraints that prevent them upfront.From documentation to automation: Architectural knowledge becomes executable rather than just descriptive. These changes feel significant to me, but they're evolutionary rather than revolutionary. Challenges I'm Still Working Through The learning curve: Developing fluency with constraint-driven development requires rethinking established habits.Team adoption: Not everyone is comfortable with AI-assisted development yet, and that's understandable.Tool maturity: Current AI tools are impressive but still have limitations around context understanding and complex reasoning.Validation strategies: Traditional testing approaches may not catch all AI-generated issues, so we're developing new validation patterns. A Measured Prediction Based on what I'm seeing, I expect a gradual shift over the next 3–5 years toward: More explicit architectural constraints in codebasesIncreased automation of pattern enforcementEnhanced focus on domain modeling and business rule specificationEvolution of code review practices to emphasize architectural composition over implementation details This won't happen overnight, and it won't replace fundamental architectural thinking. But it will change how we express and enforce architectural decisions. What I'm Experimenting With Currently, I'm exploring: 1. Machine-readable architecture definitions that can guide both AI and human developers. JSON { "architecture": { "layers": ["Api", "Application", "Domain", "Infrastructure"], "dependencies": { "Api": ["Application"], "Application": ["Domain"], "Infrastructure": ["Domain"] }, "patterns": { "cqrs": { "commands": "Application/Commands", "queries": "Application/Queries", "handlers": "required" } } } } 2. Architectural testing frameworks that validate system composition automatically. C# [Test] public void Architecture_Should_Enforce_Layer_Dependencies() { var result = Types.InCurrentDomain() .That().ResideInNamespace("Api") .ShouldNot().HaveDependencyOn("Infrastructure") .GetResult(); Assert.That(result.IsSuccessful, result.FailingTypes); } [Test] public void AI_Generated_Services_Should_Follow_Naming_Conventions() { var services = Types.InCurrentDomain() .That().AreClasses() .And().ImplementInterface(typeof(IDomainService)) .Should().HaveNameEndingWith("Service") .GetResult(); Assert.That(services.IsSuccessful); } 3. Constraint libraries that make common patterns easy to apply correctly, starting with domain primitives. C# ```csharp // Instead of generic controllers, define domain-specific primitives public abstract class DomainApiController<TEntity, TDto> : ControllerBase where TEntity : class, IEntity where TDto : class, IDto { // Constrained template that AI can safely compose } // Service registration primitive public static class ServiceCollectionExtensions { public static IServiceCollection AddDomainService<TService, TImplementation>( this IServiceCollection services) where TService : class where TImplementation : class, TService { // Validated, standard registration pattern return services.AddScoped<TService, TImplementation>(); } } 4. Documentation approaches that work well with AI-assisted development. An example is documenting architecture in the Arc42 template in Markdown, diagrams in Mermaid embedded in Markdown. Early results are promising, but there's still much to learn and explore. Looking Forward After 18 years in this field, I've learned to be both optimistic about new possibilities and realistic about the pace of change. VS Code Agent Mode represents an interesting step forward in human-AI collaboration for software development. It's not a silver bullet, but it is a useful tool that can help us build better systems — if we approach it thoughtfully. The architectures that thrive in an AI-assisted world won't necessarily be the most sophisticated ones. They'll be the ones that most clearly encode human insight in ways that both AI and human developers can understand and extend. That's a worthy goal, regardless of the tools we use to achieve it. Final Thoughts The most valuable architectural skill has always been clarity of thought about complex systems. AI tools like Agent Mode don't change this fundamental requirement — they just give us new ways to express and validate that clarity. As we navigate this transition, the architects who succeed will be those who remain focused on the essential questions: What are we trying to build? Why does it matter? How can we make success more likely than failure? The tools continue to evolve, but these questions remain constant. I'm curious about your experiences with AI-assisted development. What patterns are you seeing? What challenges are you facing? The best insights come from comparing experiences across different contexts and domains.

By Shashi Kumar

Distributed Cloud-Based Dynamic Configuration Management

It is not uncommon for back-end software to have a configuration file to start up with. These are generally YAML or JSON files, which are loaded by the system while starting up, and are then used to set up initial configuration for a system. Values included here may affect business logic or infrastructure. Let us create a new service called DumplingSale (because I love dumplings, or as we call them, momos). This service is used for managing the sales of dumplings. As an example, take a look at this YAML file, used to start a service called DumplingSale. Java # Production configuration for DumplingSale Java web application # prod.yaml redis: host: redis-prod.example.com port: 6379 password: ${REDIS_PASSWORD} timeout: 3000ms logging: level: com.example.dumplingsale: INFO org.springframework.web: WARN file: name: /var/log/dumpling-sale/application.log max-size: 100MB max-history: 30 # Section we might want to change dynamically dumpling-sale-config: max-orders-per-minute: 100 order-timeout-minutes: 30 enable-analytics: true payment-provider-config: active-provider: "your-payment-provider.com" transaction-timeout-seconds: 10 retry-attempts: 3 default-currency: "USD" api-version: "2024-05-26" enable-3ds-secure: true webhook-verification-enabled: true In the above section, let’s say we wanted to change our dumpling sale section dynamically. This could involve changing the order timeout minutes to 15 in times of increased sales, or reducing max orders per minute if the kitchen is backed up. In a default static configuration system, the static configuration will have to be changed by making a code change and then deploying it over our servers again. This would likely involve a restart of your servers. If there is a separation of code and config, you could possibly keep the code the same, but the servers would need to be restarted. However, in a dynamic configuration system, we could change the config at one place and have it changed in all our servers. Configuration Use Cases AllowLists and blocklists: Configs can allow you to manage allowlists or blocklists, and dynamically update them as your service runs.Performance tuning: You can change the number of threads, timeouts, workers, endpoints, etc., without having to restart your application.Flags: Think of any flags you pass to your application, and you could change them dynamically. In this article, we will follow the above Dumpling Store example and modify the payment provider and the dumpling sale config dynamically. Types of Config Delivery There are broadly two types of config delivery: push or pull. Push config delivery: In this system, the config mechanism delivers the configuration to all applications using the same mechanism. Pull config delivery: In this system, the config mechanism waits and responds with configuration when polled by your application. In this example, we will be using a pull delivery system. Data Structures We will have a parent data structure called dynamic configuration, but we will have child data structures for each config type you wish to support. I will be using Java to explain the example here, but feel free to use a language of your choice. Java import lombok.Data; import lombok.Builder; import java.util.List; import java.util.ArrayList; // Lombok Annotations @Data @Builder public class DumplingSaleConfig { private int maxOrdersPerMinute; private int orderTimeoutMinutes; private boolean enableAnalytics; } @Data @Builder public class PaymentProviderConfig { private String activeProvider; private int transactionTimeoutSeconds; private int retryAttempts; private String defaultCurrency; private String apiVersion; private boolean enable3dsSecure; private boolean webhookVerificationEnabled; } Creating the Cache Similarly, you will need to create a cache to fetch these configs. We are using the Guava Cache. Java public class AppConfigManager { private final LoadingCache<String, AppConfigData> appConfigCache; private final Yaml yaml; public AppConfigManager(AWSAppConfig awsAppConfig) { this.yaml = new Yaml(); this.awsAppConfig = awsAppConfig; this.appConfigCache = CacheBuilder.newBuilder() .refreshAfterWrite(5, TimeUnit.MINUTES) .build(new CacheLoader<String, AppConfigData>() { @Override public AppConfigData load(String key) throws Exception { return fetchConfigFromAppConfig(); } }); } Loading Contents for the Cache As you can see above, we created a cache that would return data of the type AppConfigData. However, the cache needs to fetch this data from somewhere as well, right? So, we would program a data source for the same, which allows dynamic configuration data to be loaded. Here are your options: Remote file: A remote file pulled by the servers. Could be stored in AWS S3, GCS, or any other object or file storage system you may have access to. Pros: Fast and easy deploymentYour object/file system may offer version history and audit logs.Cons: Not great tracking of versions, comparison of config across versions.A remote database Pros: All databases come with a great set of libraries and tools to integrate easily.Cons: Not great tracking of versions, comparison of config across versions.Depending on the database, unlikely to have auditing.Unless a custom version-based solution is created, no versioning.[Recommended] Cloud Config management system: Such as AWS Appconfig, Azure App Configuration, or GCP Firebase Remote Config. Pros: Fast, standardized rollout mechanisms: Can use both push and pull methods with slow/fast rollout across your service.Deploy changes across a variety of targets together, including compute, containers, docs, mobile applications, and serverless applications.Cons: You need to read the rest of this article to know how to use these. Creating Configuration Using AWS AppConfig Let’s use AWS Appconfig for this example. All the cloud solutions are great services, and we need just one to learn how to create this config. The above is a diagram from AWS AppConfig, which talks about the various steps you can take to ensure config deployments are safe and stable. 1. Choose config type: AWS allows you to use feature flags or freeform configs. We will choose freeform configs in this example to simplify config creation. 2. Choose a config name: my-config (I am keeping it simple.) 3. Choose config source: YAML dumpling_sale_config: maxOrdersPerMinute: 100 orderTimeoutMinutes: 30 enableAnalytics: true payment_provider_config: activeProvider: "your-payment-provider" transactionTimeoutSeconds: 10 retryAttempts: 3 defaultCurrency: "USD" apiVersion: "2024-05-26" enable3dsSecure: true Simply save this to an application (to keep it simple here: my-application). Now, let's save and deploy: As you can see above, I made the following choices: Environment: I created an environment called prod. Feel free to create as many as you need.Hosted config version: Right now, we will have only one version. In the future, you can choose to change the ‘latest’ version and deploy whichever version you would like.Deployment strategy: This is crucial. For simplicity, I have chosen ‘All at once.’ However, that is not always the best strategy, as you may want to roll out slowly and observe how your service is performing and roll back if necessary. You can read about other strategies here. Once your deployment is complete, the configuration will be deployable. Fetching Configuration Using AWS AppConfig Java private AppConfigData fetchConfigFromAppConfig() throws Exception { // 1. Start a configuration session to get a token StartConfigurationSessionRequest sessionRequest = new StartConfigurationSessionRequest() .withApplicationIdentifier("my-application") .withConfigurationProfileIdentifier("my-config") .withEnvironmentIdentifier("prod") .withRequiredMinimumPollIntervalInSeconds(30); // Recommended to set a minimum poll interval StartConfigurationSessionResult sessionResult = awsAppConfig.startConfigurationSession(sessionRequest); this.configurationToken = sessionResult.getInitialConfigurationToken(); // 2. Get the latest configuration using the token GetLatestConfigurationRequest configRequest = new GetLatestConfigurationRequest() .withConfigurationToken(configurationToken); GetLatestConfigurationResult configResult = awsAppConfig.getLatestConfiguration(configRequest); ByteBuffer configurationContent = configResult.getConfiguration(); if (configurationContent == null) { throw new IOException("No configuration content received from AWS AppConfig."); } // 3. Convert ByteBuffer to String String fatYaml = IOUtils.toString(configurationContent.asInputStream(), StandardCharsets.UTF_8); // 4. Parse YAML into the AppConfigData class return yaml.loadAs(fatYaml, AppConfigData.class); } Bringing It All Together Java import lombok.AllArgsConstructor; import lombok.Builder; import lombok.Data; import lombok.NoArgsConstructor; import com.google.common.cache.CacheBuilder; import com.google.common.cache.CacheLoader; import com.google.common.cache.LoadingCache; import org.yaml.snakeyaml.Yaml; import com.amazonaws.services.appconfig.AWSAppConfig; import com.amazonaws.services.appconfig.model.GetLatestConfigurationRequest; import com.amazonaws.services.appconfig.model.StartConfigurationSessionRequest; import com.amazonaws.services.appconfig.model.StartConfigurationSessionResult; import com.amazonaws.services.appconfig.model.GetLatestConfigurationResult; import com.amazonaws.util.IOUtils; // For converting ByteBuffer to String import java.nio.ByteBuffer; import java.util.concurrent.TimeUnit; import java.io.IOException; import java.nio.charset.StandardCharsets; @Data @Builder @NoArgsConstructor @AllArgsConstructor public class PaymentProviderConfig { private String activeProvider; private int transactionTimeoutSeconds; private int retryAttempts; private String defaultCurrency; private String apiVersion; private boolean enable3dsSecure; private boolean webhookVerificationEnabled; } @Data @Builder @NoArgsConstructor @AllArgsConstructor public class DumplingSaleConfig { private int maxOrdersPerMinute; private int orderTimeoutMinutes; private boolean enableAnalytics; } @Data @Builder @NoArgsConstructor @AllArgsConstructor public class AppConfigData { private DumplingSaleConfig dumplingSaleConfig; private PaymentProviderConfig paymentProviderConfig; } public class AppConfigManager { private final LoadingCache<String, AppConfigData> appConfigCache; private final Yaml yaml; private final AWSAppConfig awsAppConfig; private String configurationToken; // To store the session token for subsequent fetches public AppConfigManager(AWSAppConfig awsAppConfig) { this.yaml = new Yaml(); this.awsAppConfig = awsAppConfig; this.appConfigCache = CacheBuilder.newBuilder() .refreshAfterWrite(5, TimeUnit.MINUTES) .build(new CacheLoader<String, AppConfigData>() { @Override public AppConfigData load(String key) throws Exception { return fetchConfigFromAppConfig(); } }); } private AppConfigData fetchConfigFromAppConfig() throws Exception { // 1. Start a configuration session to get a token StartConfigurationSessionRequest sessionRequest = new StartConfigurationSessionRequest() .withApplicationIdentifier("my-application") .withConfigurationProfileIdentifier("my-config") .withEnvironmentIdentifier("prod") .withRequiredMinimumPollIntervalInSeconds(30); // Recommended to set a minimum poll interval StartConfigurationSessionResult sessionResult = awsAppConfig.startConfigurationSession(sessionRequest); this.configurationToken = sessionResult.getInitialConfigurationToken(); // 2. Get the latest configuration using the token GetLatestConfigurationRequest configRequest = new GetLatestConfigurationRequest() .withConfigurationToken(configurationToken); GetLatestConfigurationResult configResult = awsAppConfig.getLatestConfiguration(configRequest); ByteBuffer configurationContent = configResult.getConfiguration(); if (configurationContent == null) { throw new IOException("No configuration content received from AWS AppConfig."); } // 3. Convert ByteBuffer to String String fatYaml = IOUtils.toString(configurationContent.asInputStream(), StandardCharsets.UTF_8); // 4. Parse YAML into the AppConfigData class return yaml.loadAs(fatYaml, AppConfigData.class); } public AppConfigData getAppConfig() { try { return appConfigCache.get("appConfig"); } catch (Exception e) { System.err.println("Error loading app config: " + e.getMessage()); return null; } } } Using the Config Now let’s say you had to check if the orders per minute metric was breached and based on the same, you would take a decision, you could simply use this config manager to get details on the order. Java import lombok.AllArgsConstructor; @AllArgsConstructor public class OrderRateLimiter { private final AppConfigManager appConfigManager; public boolean isOrderLimitExceeded(int currentOrdersThisMinute) { AppConfigData appConfig = appConfigManager.getAppConfig(); if (appConfig == null || appConfig.getDumplingSaleConfig() == null) { System.err.println("DumplingSaleConfig not available from AppConfigManager."); return false; } DumplingSaleConfig config = appConfig.getDumplingSaleConfig(); return currentOrdersThisMinute > config.getMaxOrdersPerMinute(); } } As you see above, you can simply fetch the config you want, without worrying about where it is coming from (cache/AWS Appconfig), and can make decisions on the basis of the same. Key Takeaways Using the LoadingCache allows for : Faster retrieval.Thread safety, since the cache handles its own refresh logic, and any number of calls to the cache can be easily handled.Hands off management for value refresh.Low cost even with very high retrievals: as an example, let’s say you have a 100 servers running the application, needing a config 500 times per second, you will only still be billed for 100*12 = 1200 requests per hour since you are refreshing the cache every 5 minutes, as opposed to 100 * 500 * 3600 = 180 million requests if you didn’t have a cache.Low network utilization since requests are locally served.Higher availability, in case the config service is down. While using Cloud-based config management systems allows for: Easier management of config lifecycle.Better rollout strategies.Centralized management. Now, you are ready to create your own distributed cloud-based dynamic configurations.

By Alankrit Kharbanda

Enable AWS Budget Notifications With SNS Using AWS CDK

Keeping track of AWS spend is very important. Especially since it’s so easy to create resources, you might forget to turn off an EC2 instance or container you started, or remove a CDK stack for a specific experiment. Costs can creep up fast if you don’t put guardrails in place. Recently, I had to set up budgets across multiple AWS accounts for my team. Along the way, I learned a few gotchas (especially around SNS and KMS policies) that weren’t immediately clear to me as I started out writing AWS CDK code. In this post, we’ll go through how to: Create AWS Budgets with AWS CDKSend notifications via email and SNSHandle cases like encrypted topics and configuring resource policies If you’re setting up AWS Budgets for the first time, I hope this post will save you some trial and error. What Are AWS Budgets? AWS Budgets is part of AWS Billing and Cost Management. It lets you set guardrails for spend and usage limits. You can define a budget around cost, usage, or even commitment plans (like Reserved Instances and Savings Plans) and trigger alerts when you cross a threshold. You can think of Budgets as your planned spend tracker. Budgets are great for: Alerting when costs hit predefined thresholds (e.g., 80% of your budgeted spend)Driving team accountability by tying alerts to product or account ownersEnforcing a cap on monthly spend, triggering an action, and shutting down compute (EC2), if you go over budget (be careful with this) Keep in mind that budgets and their notifications are not instant. AWS billing data is processed multiple times a day, but you might trigger your budget a couple of hours after you’ve passed your threshold. This is clearly stated in the AWS documentation as: AWS billing data, which Budgets uses to monitor resources, is updated at least once per day. Keep in mind that budget information and associated alerts are updated and sent according to this data refresh cadence. Defining Budgets With AWS CDK You can create different kinds of budgets, depending on your requirements. Some examples are: Fixed budgets: Set one amount to monitor every budget period.Planned budgets: Set different amounts to monitor each budget period.Auto-adjusting budgets: Set a budget amount to be adjusted automatically based on the spending pattern over a time range that you specify. We’ll start with a simple example of how you can create a budget in the CDK. We’ll go for a fixed budget of about $100. The AWS CDK currently only has Level 1 constructs available for budgets, which means that the classes in the CDK are a 1 to 1 mapping to the CloudFormation resources. Because of this, you will have to explicitly define all required properties (constructs, IAM policies, resource policies, etc), which otherwise could be taken care of by a CDK L2 construct. It also means your CDK code will be a bit more verbose. We’ll start by using the CfnBudget construct. TypeScript new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' } } In the above example, we’ve created a budget with a limit of $100 per month. A budget alone isn’t very useful. You’d still have to check into the AWS console manually to see what your spend is compared to your budget. The important thing is that we want to get notified in case we reach our budget or our forecasted budget reaches our threshold, so let’s add a notification and a subscriber. TypeScript new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' }, notificationsWithSubscribers: [{ notification: { comparisonOperator: 'GREATER_THAN', notificationType: 'FORECASTED', threshold: 100, thresholdType: 'PERCENTAGE' }, subscribers: [{ subscriptionType: 'EMAIL', address: '<your-email-address>' }] }] }); Based on the notification settings, interested parties are notified when the spend is forecasted to exceed 100% of our defined budget limit. You can put a notification on forecasted or actual percentages. When that happens, an email is sent to the designated email address. Subscribers, at the time of writing, can be either email recipients or a Simple Notification Service (SNS) topic. In the above code example, we use email subscribers for which you can add up to 10 recipients. Depending on your team or organization, it might be beneficial to switch to using an SNS topic. The advantage of using an SNS topic over a set of email subscribers is that you can add different kinds of subscribers (email, chat, custom lambda functions) to your SNS topic. With an SNS topic, you have a single place to configure subscribers, and if you change your mind, you can do so in one place instead of updating all budgets. Using an SNS Topic also allows you to push budget notifications to, for instance, a chat client like MS Teams or Slack. In this case, we will make use of SNS in combination with email subscribers. Let’s start by defining an SNS topic with the AWS CDK. TypeScript // Create a topic for email notifications let topic = new Topic(this, 'budget-notifications-topic', { topicName: 'budget-notifications-topic' }); Now, let’s add an email subscriber, as this is the simplest way to receive budget notifications. TypeScript // Add email subscription topic.addSubscription( new EmailSubscription("your-email-address")); This looks pretty straightforward, and you might think you’re done, but there is one important step to take next, which I initially forgot. The AWS budgets service will need to be granted permissions to publish messages to the topic. To be able to do this, we will need to add a resource policy to the topic that allows the budgets service to call the SNS:Publish action for our topic. TypeScript // Add resource policy to allow the budgets service to publish to the SNS topic topic.addToResourcePolicy(new PolicyStatement({ actions:["SNS:Publish"], effect: Effect.ALLOW, principals: [new ServicePrincipal("budgets.amazonaws.com")], resources: [topic.topicArn], conditions: { ArnEquals: { 'aws:SourceArn': `arn:aws:budgets::${Stack.of(this).account}:*`, }, StringEquals: { 'aws:SourceAccount': Stack.of(this).account, }, }, })) Now, let’s assign the SNS topic as a subscriber in our CDK code. TypeScript // Define a fixed budget with SNS as subscriber new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' }, notificationsWithSubscribers: [{ notification: { comparisonOperator: 'GREATER_THAN', notificationType: 'FORECASTED', threshold: 100, thresholdType: 'PERCENTAGE' }, subscribers: [{ subscriptionType: 'SNS', address: topic.topicArn }] }] }); Working With Encrypted Topics If you have an SNS topic with encryption enabled (via KMS), you will need to make sure that the corresponding service has access to the KMS key. If you don’t, you will not get any messages, and as far as I could tell, you will see no errors (at least I could find none in CloudTrail). I actually wasted a couple of hours trying to figure this part out. I should have read the documentation, as it is explicitly stated to do so. I guess I should start with the docs instead of diving right into the AWS CDK code. TypeScript // Create KMS key used for encryption let key = new Key(this,'sns-kms-key', { alias: 'sns-kms-key', enabled: true, description: 'Key used for SNS topic encryption' }); // Create topic and assign the KMS key let topic = new Topic(this, 'budget-notifications-topic', { topicName: 'budget-notifications-topic', masterKey: key }); Now, let’s add the resource policy to the key and try to trim down the permissions as much as possible. TypeScript // Allow access from budgets service key.addToResourcePolicy(new PolicyStatement({ effect: Effect.ALLOW, actions: ["kms:GenerateDataKey*","kms:Decrypt"], principals: [new ServicePrincipal("budgets.amazonaws.com")], resources: ["*"], conditions: { StringEquals: { 'aws:SourceAccount': Stack.of(this).account, }, ArnLike: { "aws:SourceArn": "arn:aws:budgets::" + Stack.of(this).account +":*" } } })); Putting It All Together If you’ve configured everything correctly and deployed your stack to your target account, you should be good to go. Once you cross your threshold, you should be notified by email that your budget is exceeding one of your thresholds (depending on the threshold set). Summary In this post, we explored how to create AWS Budgets with AWS CDK and send notifications through email or SNS. Along the way, we covered some important topics like: Budgets alone aren’t useful until you add notifications.SNS topics need a resource policy so the Budgets service can publish.Encrypted topics require KMS permissions for the Budgets service. With these pieces in place, you’ll have a setup that alerts your team when costs exceed thresholds via email, chat, or custom integrations. A fully working CDK application with the code mentioned in this blog post can be found in the following GitHub repo.

By Jeroen Reijn

CORE

Building a Platform Abstraction for EKS Cluster Using Crossplane

Building on what we started earlier in an earlier article, here we’re going to learn how to extend our platform and create a platform abstraction for provisioning an AWS EKS cluster. EKS is AWS’s managed Kubernetes offering. Quick Refresher Crossplane is a Kubernetes CRD-based add-on that abstracts cloud implementations and lets us manage Infrastructure as code. Prerequisites Set up Docker Kubernetes.Follow the Crossplane installation based on the previous article.Follow the provider configuration based on the previous article.Apply all the network YAMLs from the previous article (including the updated network composition discussed later). This will create the necessary network resources for the EKS cluster. Some Plumbing When creating an EKS cluster, AWS needs to: Spin up the control plane (managed by AWS)Attach security groups Configure networking (ENIs, etc)Access the VPC and subnetsManage API endpointsInteract with other AWS services (e.g., CloudWatch for logging, Route53) To do this securely, AWS requires an IAM role that it can assume. We create that role here and reference it during cluster creation; details are provided below. Without this role, you'll get errors like "access denied" when creating the cluster. Steps to Create the AWS IAM Role Log in to the AWS Console and go to the IAM creation page.In the left sidebar, click RolesClick Create Role.Choose AWS service as the trusted entity type.Select the EKS use case, and choose the EKS Cluster.Attach the following policies: AmazonEKSClusterPolicyAmazonEKSServicePolicyAmazonEC2FullAccessAmazonEKSWorkerNodePolicyAmazonEC2ContainerRegistryReadOnlyAmazonEKS_CNI_PolicyProvide the name eks-crossplane-cluster and optionally add tags. Since we'll also create NodeGroups, which require additional permissions, for simplicity, I'm granting the Crossplane user (created in the previous article) permission to PassRole for the Crossplane cluster role, and this permission allows this user to tell AWS services (EKS) to assume the Crossplane cluster role on its behalf. Basically, this user can say, "Hey, EKS service, create a node group and use this role when doing it." To accomplish this, add the following inline policy to the Crossplane user: JSON { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::914797696655:role/eks-crossplane-clsuter" } ] } Note: Typically, to follow the principle of Least Privilege, you should separate roles with policies: Control plane role with EKS admin permissionsNode role with permissions for node group creation. In the previous article, I had created only one subnet in the network composition, but the EKS control plane requires at least two AZs, with one subnet per AZ. You should modify the network composition from the previous article to add another subnet. To do so, just add the following to the network composition YAML, and don't forget to apply the composition and claim to re-create the network. YAML - name: subnet-b base: apiVersion: ec2.aws.upbound.io/v1beta1 kind: Subnet spec: forProvider: cidrBlock: 10.0.2.0/24 availabilityZone: us-east-1b mapPublicIpOnLaunch: true region: us-east-1 providerConfigRef: name: default patches: - fromFieldPath: status.vpcId toFieldPath: spec.forProvider.vpcId type: FromCompositeFieldPath - fromFieldPath: spec.claimRef.name toFieldPath: spec.forProvider.tags.Name type: FromCompositeFieldPath transforms: - type: string string: fmt: "%s-subnet-b" - fromFieldPath: status.atProvider.id toFieldPath: status.subnetIds[1] type: ToCompositeFieldPath We will also need a provider to support EKS resource creation, to create the necessary provider, save the following content into .yaml file. YAML apiVersion: pkg.crossplane.io/v1 kind: Provider metadata: name: provider-aws spec: package: xpkg.upbound.io/crossplane-contrib/provider-aws:v0.54.2 controllerConfigRef: name: default And apply using: YAML kubectl apply -f <your-file-name>.yaml Crossplane Composite Resource Definition (XRD) Below, we’re going to build a Composite Resource Definition for the EKS cluster. Before diving in, one thing to note: If you’ve already created the network resources using the previous article, you may have noticed that the network composition includes a field that places the subnet ID into the composition resource’s status, specifically under status.subnetIds[0]. This value comes from the cloud's Subnet resource and is needed by other XCluster compositions. By placing it in the status field, the network composition makes it possible for other Crossplane compositions to reference and use it. Similar to what we did for network creation in the previous article, we’re going to create a Crossplane XRD, a Crossplane Composition, and finally a Claim that will result in the creation of an EKS cluster. At the end, I’ve included a table that serves as an analogy to help illustrate the relationship between the Composite Resource Definition (XRD), Composite Resource (XR), Composition, and Claim. To create an EKS XRD, save the following content into .yaml file: YAML apiVersion: apiextensions.crossplane.io/v1 kind: CompositeResourceDefinition metadata: name: xclusters.aws.platformref.crossplane.io spec: group: aws.platformref.crossplane.io names: kind: XCluster plural: xclusters claimNames: kind: Cluster plural: clusters versions: - name: v1alpha1 served: true referenceable: true schema: openAPIV3Schema: type: object required: - spec properties: spec: type: object required: - parameters properties: parameters: type: object required: - region - roleArn - networkRef properties: region: type: string description: AWS region to deploy the EKS cluster in. roleArn: type: string description: IAM role ARN for the EKS control plane. networkRef: type: object description: Reference to a pre-created XNetwork. required: - name properties: name: type: string status: type: object properties: network: type: object required: - subnetIds properties: subnetIds: type: array items: type: string And apply using: YAML kubectl apply -f <your-file-name>.yaml Crossplane Composition Composition is the implementation; it tells Crossplane how to build all the underlying resources (Control Plane, NodeGroup). To create an EKS composition, save the below content into a .yaml file: YAML apiVersion: apiextensions.crossplane.io/v1 kind: Composition metadata: name: cluster.aws.platformref.crossplane.io spec: compositeTypeRef: apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: XCluster resources: - name: network base: apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: XNetwork patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.networkRef.name toFieldPath: metadata.name - type: ToCompositeFieldPath fromFieldPath: status.subnetIds toFieldPath: status.network.subnetIds - type: ToCompositeFieldPath fromFieldPath: status.subnetIds[0] toFieldPath: status.network.subnetIds[0] readinessChecks: - type: None - name: eks base: apiVersion: eks.aws.crossplane.io/v1beta1 kind: Cluster spec: forProvider: region: us-east-1 roleArn: "" resourcesVpcConfig: subnetIds: [] endpointPrivateAccess: true endpointPublicAccess: true providerConfigRef: name: default patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.region toFieldPath: spec.forProvider.region - type: FromCompositeFieldPath fromFieldPath: spec.parameters.roleArn toFieldPath: spec.forProvider.roleArn - type: FromCompositeFieldPath fromFieldPath: status.network.subnetIds toFieldPath: spec.forProvider.resourcesVpcConfig.subnetIds - name: nodegroup base: apiVersion: eks.aws.crossplane.io/v1alpha1 kind: NodeGroup spec: forProvider: region: us-east-1 clusterNameSelector: matchControllerRef: true nodeRole: "" subnets: [] scalingConfig: desiredSize: 2 maxSize: 3 minSize: 1 instanceTypes: - t3.medium amiType: AL2_x86_64 diskSize: 20 providerConfigRef: name: default patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.region toFieldPath: spec.forProvider.region - type: FromCompositeFieldPath fromFieldPath: spec.parameters.roleArn toFieldPath: spec.forProvider.nodeRole - type: FromCompositeFieldPath fromFieldPath: status.network.subnetIds toFieldPath: spec.forProvider.subnets And apply using: YAML kubectl apply -f <your-file-name>.yaml Claim I'm taking the liberty to explain the claim in more detail here. First, it's important to note that a claim is an entirely optional entity in Crossplane. It is essentially a Kubernetes Custom Resource Definition (CRD) that the platform team can expose to application developers as a self-service interface for requesting infrastructure, such as an EKS cluster. Think of it as an API payload: a lightweight, developer-friendly abstraction layer. In the earlier CompositeResourceDefinition (XRD), we created the Kind XCluster. But by using a claim, application developers can interact with a much simpler and more intuitive CRD like Cluster instead of XCluster. For simplicity, I have referenced the XNetwork composition name directly instead of the Network claim resource name. Crossplane creates the XNetwork resource and appends random characters to the claim name when naming it. As an additional step, you'll need to retrieve the actual XNetwork name from the Kubernetes API and use it here. While there are ways to automate this process, I’m keeping it simple here, let me know via comments if there are interest and I write more about how to automate that. To create a claim, save the content below into a .yaml file. Please note the roleArn being referenced in this, that is the role I had mentioned earlier, AWS uses it to create other resources. YAML apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: Cluster metadata: name: demo-cluster namespace: default spec: parameters: region: us-east-1 roleArn: arn:aws:iam::914797696655:role/eks-crossplane-clsuter networkRef: name: crossplane-demo-network-jpv49 # <important> this is how EKS composition refers the network created earlier not the random character "jpv49" from XNetwork name And apply using: YAML kubectl apply -f <your-file-name>.yaml After this, you should see an EKS cluster in your AWS console, and ensure you are looking in the correct region. If there are any issues, look for error logs in the composite and managed resource. You could look at them using: YAML -- to get XCluster detail k get XCluster demo-cluster -o yaml # look for reconciliation errors or messages, you will also find reference to managed resource -- to look for status of a managed resource, example. k get Cluster.eks.aws.crossplane.io As I mentioned before, below is a table where I attempt to provide another analogy for various components used in Crossplane: componentanalogy XRD The interface, or blueprint for a product, defines what knobs users can turn XR (XCluster) A specific product instance with user-provided values Composition The function that implements all the details of the product Claim A customer-friendly interface for ordering the product, or an api payload. Patch I also want to explain an important concept we've used in our Composition: patching. You may have noticed the patches field in the .yaml blocks. In Crossplane, a composite resource is the high-level abstraction we define — in our case, that's XCluster. Managed resources are the actual cloud resources Crossplane provisions on our behalf — for example, the AWS EKS Cluster, Nodegroup A patch in a Crossplane Composition is a way to copy or transform data from/to the composite resource (XCluster) to/from the managed resources (Cluster, NodeGroup, etc.). Patching allows us to map values like region, roleArn, and names from the high-level composite to the actual underlying infrastructure — ensuring that developer inputs (or platform-defined parameters) flow all the way down to the cloud resources. Conclusion Using Crossplane, you can build powerful abstractions that shield developers from the complexities of infrastructure, allowing them to focus on writing application code. These abstractions can also be made cloud-agnostic, enabling benefits like portability, cost optimization, resilience and redundancy, and greater standardization.

By Ramesh Sinha

Creating a Distributed Computing Cluster for a Data Base Management System: Part 1

Ideas of creating a distributed computing cluster (DCC) for database management systems (DBMS) have been striking me for quite a long time. If simplified, the DCC software makes it possible to combine many servers into one super server (cluster), performing an even balancing of all queries between individual servers. In this case, everything will appear for the application running on the DCC as if it was running with one server and one database (DB). It will not be dispersed databases on distributed servers, but work as one virtual one. All network protocols, replication exchanges, and proxy redirections will be concealed inside the DCC. At the same time, all resources of distributed servers, in particular RAM and CPU time, will be utilized evenly and in an efficient fashion. For example, in a cloud data processing center (DPC), it is possible to take one physical super server and divide it into a number of virtual DBMS servers. But the reverse procedure was not possible until now, i.e., it is not possible to take a number of physical servers and merge them into a single virtual DBMS super server. In some specified sense, DCC is a technology that makes it possible to merge physical servers into one virtual DBMS super server. I will take the liberty to make another comparison: DCC is just the same as the coherent nonuniform memory access (NUMA) technology except that it is used to merge SQL servers. But unlike NUMA, in DCC, software handles the synchronization (coherence) of the data and partly of the RAM, not the controller. For the sake of clarity, below is a diagram of the well-known connection of the client application to the DBMS server, and the DCC diagram immediately below that. Both diagrams are simplified, just for easy understanding. The idea behind the cluster is a decentralized model. In the figure above there is only one proxy server, but in general there can be more than one. This solution will result in the possibility to increase the DBMS scalability by a substantial margin relative to a typical single-server solution with the most powerful server at the moment. No such solution currently exists, or, at least, no one in my vast professional community is aware of such a solution. After five years of research, I worked out the logical architecture and interaction protocols in detail and, with the assistance of a handful of development personnel, created a working prototype that is undergoing load tests on a popular 1C8.x IT system under the management of PostgreSQL DBMS. MS SQL or Oracle may be the DBMS. Fundamentally, the choice of DBMS does not affect the ideas I will bring up. With this article, I am starting a series of articles on DCC, where I will gradually disclose one or another issue and offer solutions to them. I came up with this structure after speaking at one of the IT conferences, where the topic was found to be quite difficult to understand. The first article will be introductory, I will hit the peaks, skip the valleys (emphasizing non-obvious assertions), and outline what's to come in the following publications. For Which IT Systems DCC Is Effective The idea of DCC is to create a special software shell, which will perform all write requests simultaneously and synchronously on all servers, and read requests will be performed on a specific node (server) with user binding. In other words, users will be evenly distributed among the servers of the cluster: read requests will be executed locally on the server to which the user is bound, and change requests will be synchronously executed simultaneously on all servers (no logic violations will occur as a result). Therefore, provided that read requests significantly exceed write requests in terms of load, we get a roughly uniform load distribution among DCC servers (nodes). Let's first review this question: is the statement that the load of read requests far outweighs the load of write requests correct? To answer this question, it will be helpful to look back a bit at the history of the SQL language: what was the goal and what eventually came to fruition. A Quick Dive Into SQL SQL was originally planned as a language that could be used without programming or math skills. Here's an excerpt from Wikipedia: Codd used symbolic notation with mathematical notation of operations, but Chamberlin and Boyce wanted to design the language so that it could be used by any user, even those without programming skills or knowledge of math.[5] For now, it can be argued that programming skills for SQL are still needed, but definitely minimal. Most programmers have studied some basics of query optimization and have never heard of SQL Tuning by Dan Toe. A lot of logic for optimizing queries is concealed inside the DBMS. In the past, for example, MS SQL had a limit of 256 table joins; now in modern IT systems, it is common to have thousands of joins in a query. Dynamic SQL, where a query is constructed dynamically, is used widely and sometimes without much thought. The truth is there is no mathematically accurate model for plotting the optimal plan for executing a complex query. This problem is somewhat similar to the traveling salesman problem, and it is believed to have no exact mathematical solution. The conclusion is as follows: SQL queries have evolutionarily proven their effectiveness and almost all reporting is generated on SQL queries, which is not the case with business logic and transactional logic. Many of the SQL languages turned out to be not very convenient in terms of programming complex transactional logic. It does not support object-oriented programming and has very clumsy logic constructs. Therefore, it is safe to say that programming has split into two components. Writing a variety of SQL query reports, getting data to a customer or application server, and implementing the rest of the logic in the application-oriented language of the application (no matter if it's a two-tier or three-tier architecture). In terms of load on the DBMS, it looks like a hefty SQL constructs for reading and then lots of small ones for changing. Let us now consider the issue of load distribution of read-write queries on the DBMS server time wise. First, we need to define what load means and how it can be measured. The load will mean (in the order of priority of description): CPU (processor load), utilized RAM, and load on disk subsystem. CPU will be the main resource in terms of load. Let's consider an abstract OLTP system and divide all SQL calls from a set of parallel threads into two categories Read and Write. Next, based on the performance monitoring tools, plot an integral value such as CPU on a diagram. If the value is averaged for at least 30 seconds, we see that the value of the “Read” diagram is tens or even hundreds of times higher than the value of the “Write” diagram. This is because more users per unit of time can execute reports or macro transactions that use hefty SQL constructs on reading. Sure, there may be tasks when the system regularly loads data from replication queues and external systems, period end routine procedures are started, and backup procedures are started. But based on long-term statistics for an overwhelming number of IT systems, the load of SQL constructs on Read exceeds the load on Write by ten folds. Certainly, there may be exceptions, for example, billing systems where the fact of changes is recorded without any complex logic and reporting, but it is easy to check this with a special-purpose software and understand how effective DCC will be for the IT system. Strategic Area of Application Currently, DCC will be useful and perhaps vital for major companies with extensive information flows and a strong analytical component. For example, major banks may profit. With the help of relatively small servers, it is possible to compose a DCC, which will be far ahead of all existing supercomputers in terms of power. Needless to say, it won't be all about the pros. The downside will be the increasingly complex part of administering a distributed system and a definitive transaction slowdown. Unfortunately, it is true that the network protocols and logic circuits that DCC utilizes cause transactions to slow down. Currently, the target parameter is a transaction slowdown of no more than 15% in terms of time. But once again I repeat that in this case the system will become much more scalable, all peak loads will be without problems, and on average the transaction time will be less than in the case of using a single server. Therefore, if the system faces no problems with peak loads and strategically it is not expected, DCC will not be effective. In the future, DCC after the automation of administrative processes and optimization will probably also be effective for medium-sized companies because it will be possible to build a powerful cluster even using PCs with SSD (fast, unreliable, and cheap) disks. Its distributed structure will make it possible to easily disconnect a failed PC and connect a new one right on-the-fly. DCC's transaction control system will prevent data from being incorrectly recorded. Also, geopolitics cannot be ignored. For example, in case of lack of access to powerful servers, DCC will make it possible to build a powerful cluster using servers produced by domestic manufacturers. Why Transactional Replication Cannot Be Used for DCC This section requires a detailed description, and I will cover it in a separate article. Here I will point out only these problems: Many application developers, when using a DBMS, do not even think about what data access conflicts the system resolves within the engine. For example, it is impossible to set up transactional replication, achieve data synchronization across multiple servers, and call it a DBMS cluster. This solution will not resolve the conflict of simultaneous access to a writer-writer type record. Such collisions will certainly lead to a violation of the logic of the system behavior. Existing transaction replication protocols are also costly and such a system will be very much inferior to the single server option. In total, transactional replication is not suitable for ВСС because: 1. Excessive Costs of Typical Synchronous Transaction Replication Protocols Typical distributed transaction protocols have too many additional, primarily time-related, network costs. For one network call, up to three additional calls are received. In such a form, the simplest atomic operations degrade dramatically. 2. The Writer-Writer Conflict Is Not Resolved A conflict happens when the same data is changed simultaneously in different transactions. In terms of past change, the system only “remembers” the absolute last change (or history). The point of the SQL construct for sequential application gets lost. Such replication conflicts sometimes have no solutions at all. In a separate article, I will give an example of different replication types for PostgreSQL and Microsoft SQL, and I will explain: Why they cannot solve the transactional load balancing problem architecturallyWhy it is not solved architecturally at the hardware level The writer-writer problem is fundamentally unsolvable without a proxy service at the logical level of analyzing the application's SQL traffic. Exchange Mechanisms (Protocols) A full architectural description of DCC will be provided in a separate article. For now, let's confine ourselves to a brief summary to outline the issue at hand. All queries to the DBMS in DCC go through a proxy service. For example, on 1C systems, it can be installed on the application server. The proxy service recognizes whether the query type is Read or Write. And if Read, it sends it to the server bound to the user (session). If the query type is Change, it sends it to all servers asynchronously. It does not proceed to the next query until it receives a positive response from all servers. If an error occurs, it is propagated to the client application and the transaction is rolled back on all servers. If all servers have confirmed successful execution of the SQL construct, only then does the proxy process the next client SQL query. This is the kind of logic that does not result in logical contradictions. As can be seen, this arrangement incurs additional network and logical costs, although with proper optimization, they are minimal (we seek to achieve no more than 15% of the time delay of transactions). The algorithm described above is the basic protocol, and it is what we will call mirror-parallel. Unfortunately, this protocol is not logically capable of implementing mirrored data replication for all IT systems. In some cases, the data might for sure differ due to the specific nature of the system, another protocol is implemented for this purpose — “centralized asynchronous” — which will resolve synchronous information transfer for sure. The next section will cover it. Why a Centralized Protocol Is Needed in DCC Unfortunately, in some cases, sending the same structure to different servers gets assuredly different results. For example, when inserting a new record into a table, the primary key is generated based on the GUID on the server part. In this case, based on the definition alone, we will for sure get different results on different servers. As an option, it is possible to train the expert system of the proxy service to recognize that this field is formed on the server, form it explicitly on the proxy, and insert it into the query text. What if it is impossible to do so for some reason? To resolve such problems, another protocol and server is introduced. Let's call it Conditionally Central. Next, it will be clear that it is not actually a central server. The protocol algorithm is as follows. The proxy service recognizes that a SQL construct for a change is highly likely to produce different results on different servers. Therefore, it immediately redirects the query to the Conditionally Central server. Then after it is executed, using replication triggers, retrieve the changes that the query resulted in and send asynchronously all those changes to the remaining servers. And then proceed to execute the next command. Similar to the mirror-parallel protocol, if at least one of the servers encounters an error, it is redirected to the client and the transaction is rolled back. In this protocol, any collisions are completely prevented, data will always be guaranteed to be synchronous, and there will be almost no distributed deadlocks. But there is an essential downside: Due to its specific nature, the protocol imposes the highest runtime costs. Therefore, it will only be used in exceptional cases, otherwise no target delay parameters of no more than 15% will be even possible. Mechanisms for Ensuring Integrity and Synchrony of Distributed Data at the Transaction Level in DCC As we discussed in the previous section, there are logical (e.g., NEWGUID) SQL operations on change that, when executed simultaneously on different servers, will for sure take different values. Let us rule out all sorts of random functions and fluctuations. Let's assume we have explicit arithmetic procedures, e.g. UPDATE Summary Table SET Total = Total+Delta WHERE ProductID = Y. Certainly, such an arrangement in a single-thread option will lead to the same result and the data will be synchronous, because there are always laws of mathematics. But, if such constructs are executed in multithread mode by varying the Delta value, thread tangling may occur due to violation of the chronology of query execution. Which will lead to either deadlocks or data synchronization violations. In fact, it may turn out that the results of transactions on different servers may differ. Sure, it will be a rare occurrence, and it can be reduced by certain actions, but it cannot be completely ruled out, as well as it cannot be completely resolved without significant performance degradation. Such algorithms do not exist as a matter of fact, just as there is no such thing as fully synchronous time for multiple servers or network queries that are executed for sure for a certain amount of time. Therefore, DCC has a distributed transaction management service and, in particular, a transaction hash-sum check is mandatory. Why hash-sum? Because it is possible to quickly check the content of these changes on all servers. If everything matches, the transaction is confirmed, and if not, it is rolled back with a corresponding error. More details will follow in a separate article. In terms of mathematics, there are some interesting similarities with quantum mechanics, in particular with the transactional-loop theory (there is such a marginal theory). The Issue of Distributed Deadlocks in DCC This problem is that one of the key problems in DCC and in terms of risks of DCC implementation is the most dangerous. This is due to the fact that the occurrence of distributed deadlocks in DCC is a consequence of thread confusion due to the change in the chronology of SQL queries execution on different servers. This situation occurs due to uneven load on servers and network interfaces. In this case, unlike local deadlocks, which require at least two locking objects to occur, there can be only one object in a distributed deadlock. To reduce distributed deadlocks, several process challenges need to be addressed, one of them being the allocation of different physical network interfaces for writing and reading. After all, if we consider the ratio of CPU operations like Read to Write, there will be a ratio of one order, but for network traffic, the ratio will start from two orders of magnitude, more than hundreds of times. Therefore, by splitting these operations (Read-Write) on physically different channels of network communications, we can guarantee a certain time of delivery of Write-type SQL queries to all servers. Also, the fewer locks there are in the system, the less likely distributed deadlocks are in data. However, using DCCs as an additional benefit, it is possible to expand such bottlenecks, if any, in the system at the level of settings. If distributed deadlocks still occasionally occur, there is a special DCC service that monitors all blocking processes on all servers and resolves them by rolling back one of the transactions. More details will follow in a separate article. Special Features of DCC Administration Administration of a distributed system is certainly more complicated than that of a local system, especially with the requirements of operation 24/7. And all potential DCC users are just the proud owners of IT systems with 24/7 operation mode. Immediate problems include distributed database recovery and hot plugging of a new server to the DCC. Prompt data reconciliation in distributed databases is also necessary, despite transaction reconciliation mechanisms. Performance monitoring tasks and, in particular, the aggregation of counter data across related transactions and cluster servers in general begin to emerge. There are some security issues with setting up a proxy service. A full list of problems and proposed solutions will be in a separate article. Parallel Computing Technologies as a Solution to Increase the Efficiency of DCC Use For a scalable IT system, high parallelism of processes within the database is essential. For parallel reporting, as a rule, this issue does not occur. For transactional workloads, due to historical vestiges of suboptimal architecture, locks at the level of changing the same records (writer-writer conflict) are possible. If the IT system can be changed and there is open-source code, then the system can be optimized. And if the system is closed, what shall we do in this case? In case of using DCC, there are opportunities at the level of administration to circumvent such restrictions. Or at least expand the possibilities. In particular, through customizations, we can enable changing the same record without waiting for the transaction to be committed — if a dirty read is possible, of course. At the same time, if the transaction is rolled back, the change data in the chronological sequence are also rolled back. This situation is exactly appropriate, for example, for tables with aggregation of totals. I already have solutions for this problem, and I believe that regardless of using DCC, it is necessary to expand administrative settings of the DBMS, both Postgres and MSSQL (haven't investigated the issue on Oracle). More details will follow in a separate article. It is also necessary to disclose the topic of dirty reading in DCC and possible minor improvements taking this into account, such as the introduction of virtual locks. Plan for the Following Publications on the Topic of DCC Article 2. DCC load-testing results Article 3. Why transactional replication can't be used for DCC Article 4. Brief architectural description of the DCC Article 5. The purpose of a centralized protocol in DCC Article 6: Mechanisms for ensuring integrity and synchronization of distributed data at the transaction level in DCC Article 7. The problem of distributed deadlocks in DCC Article 8. Special features of DCC administration Article 9: Parallel computing technologies as a tool to increase the efficiency of DCC utilization Article 10. Example of DCC integration with 1C 8.x

By Vladimir Serdyuk

From Data Growth to Data Responsibility: Building Secure Data Systems in AWS

Enterprise data solutions are growing across data warehouses, data lakes, data lakehouse, and hybrid platforms in cloud services. As the data grows exponentially across these services, it's the data practitioners' responsibility to secure the environment with secure guardrails and privacy boundaries. In this article, we will learn a framework for implementing security protocols in AWS and learn how to implement them across Redshift, Glue, DynamoDB, and Aurora database services. The Security Framework for Modern Data Infrastructure When building scalable and secure AWS-native data platforms (Glue, Redshift, DynamoDB, Aurora), I recommend thinking of security in terms of seven pillars. Each pillar comes with practical checkpoints you can implement and audit against. Pillar 1: Identity and Access Control The identity and access control framework ensures only the right people and systems can touch your data. This starts with centralizing identities with IAM Identity Center/SSO. Enforce the principle of least privilege with IAM roles (not long-lived users) that will grant access to identities, and only the user needs access to perform their job duties. We can also leverage attribute-based access control, which uses tags at the department level, department=finance, or data_classification=pii. By starting with identity as the first pillar in building a secure data solution, we establish clear boundaries across each database object with an owning principal. Pillar 2: Data Classification and Catalog Governance The second step is to go a level deeper and classify the datasets attached to identities. In a data lake, we can label datasets, for example, like pii=high or pii=highly-confidential, etc. Once classified, these tags drive tag-based access control (TBAC) across services such as Glue and Redshift, ensuring only the right people see the right data. Along with this, maintaining column-level metadata like region or compliance domain in the Glue Data Catalog makes governance consistent and transparent. With proper classification and catalog governance, policies can be applied uniformly across the enterprise instead of in silos Pillar 3: Network and Perimeter Security Keep your data safe by making sure it only travels in private, secure paths. Put your databases in private networks, use special connections (like VPC endpoints) to reach services, and make sure all data leaving the system is encrypted and checked. Pillar 4: Encryption as Needed We should not treat every data in the same way; it has to be based on the data classification from Pillar 2. For example, some data are red (very sensitive, like financial or health records), which should be tightly secured in AWS at rest using KMS and CMKs with rotation turned on. A good practice is not to store red data in open or persistent storage. Orange data is important but less sensitive, like business logs, we should ensure proper bucket polices are applied. Green is general data that can be shared more freely, like logs, but encryption is not needed. Pillar 5: Secrets and Credential Management Never store your passwords in a code base or in any queries. In AWS, you can keep them safe in Secrets Manager, which locks them up and changes them periodically. Instead of giving every app a fixed password, let it borrow a temporary key through IAM roles, which is safer and harder to misuse. For databases like Aurora, you don’t even need a password at all; you can log in with a short-lived token. The rule is simple: don’t use permanent keys; always use rotating or temporary ones. Pillar 6: Monitoring, Detection, and Audit Think of monitoring like a CCTV camera for your data. You should always know who touched what, when, and why. In AWS, you can turn on CloudTrail to record all actions and save these records safely in CloudWatch Logs. Tools like GuardDuty act like guards watching for unusual activity, while Security Hub gathers all warnings in one place. For stricter checks, databases like Aurora and Redshift have their own audit logs, and tools like Macie scan S3 to catch if sensitive files are exposed. The idea is simple: if something goes wrong, you should be able to trace it back quickly. Pillar 7: Policy as Code We can manage the entire cloud policies as infrastructure as code rather than manual deployments for scalability purposes. In AWS, you can define things like KMS keys, IAM roles, or Lake Formation policies in CloudFormation, CDK, or Terraform. Before changes go live, tools like cfn-nag or tfsec check if something looks unsafe. For risky actions (like changing IAM roles or encryption keys), you can set up approval steps so no one sneaks in a bad change. Example #1: AWS Glue + Lake Formation (Catalog, ETL, Data Perimeter) AWS Glue works like the factory that moves and transforms your data, while Lake Formation is the guardrail that makes sure only the right people and systems can see the right parts of that data. Together, they help centralize governance, protect sensitive fields, and ensure ETL jobs run safely without leaking information. Steps to Implement Security 1. Classify your data with tags: Define tags such as: pii= {none, low, high}, pii={true, false}, region={us, eu}. Apply these tags to databases, tables, and even columns in the Glue Data Catalog. 2. Control access with tag-based policies (TBAC): Create Lake Formation permissions using tags: Analyst role: pii!=highOps role: pii in {none, low}Compliance role: {Full access, audit rights} 3. Apply row-level filters and column masking: Use LF-governed tables to filter rows (e.g., only show region=session_region). Mask sensitive columns like email, date of birth, with hash values. 4. Secure your Glue jobs: Turn on encryption for S3, CloudWatch, and job bookmarks with KMS CMKs.Run Glue jobs inside a VPC, with S3 routed through Gateway/Interface Endpoints, not the public internet. Assign a minimal IAM role per job, keeping dev and prod roles separate and scoped to exact resources. 5. Keep catalog and ETL hygiene strong: Block public access to S3 buckets (disable ACLs/policies). Require encryption on all writes (aws:SecureTransport=true, x-amz-server-side-encryption). Enable continuous logging of Glue jobs into CloudWatch for audit and troubleshooting. Example #2: Amazon Redshift (Warehouse Analytics) Amazon Redshift is your data warehouse; it's powerful for analytics, but also home to a lot of sensitive data. Protecting it means enforcing who can see which rows or columns, isolating traffic so nothing leaks, and making sure every action is logged. Steps to Implement Security 1. Network and encryption: Place Redshift clusters or serverless workgroups in private subnets (no public endpoints). Turn on encryption at rest with a customer-managed KMS key. Force SSL connections (reject non-TLS). Use Enhanced VPC Routing so COPY/UNLOAD only moves data via VPC endpoints. 2. Identity and SSO: Use IAM Identity Center or SAML for single sign-on. Avoid static keys, rely on role chaining for COPY/UNLOAD to S3. 3. Fine-grained controls: Enable Row-Level Security (RLS) and Column-Level Security (CLS). Use dynamic data masking for fields like SSNs, showing only partial data unless the role allows full access. 4. Audit and logging: Enable database audit logging to S3/CloudWatch. Integrate with CloudTrail for management events. Example #3: Amazon DynamoDB (Operational Data) Amazon DynamoDB powers fast apps at scale, but governance here is about restricting who can touch which items, keeping traffic private, and ensuring logs exist for compliance. Steps to Implement Security 1. Item-level permissions: Use IAM conditions like dynamodb:LeadingKeys to tie access to a user’s partition key (e.g., only see their own orders). For Example, bind customer_id in the request to the caller’s IAM tag. 2. Private access and encryption: Use Gateway VPC Endpoints for DynamoDB; block non-VPC traffic if possible (via SCP). Require encryption at rest with customer-managed KMS keys. 3. Resilience and lifecycle: Turn on Point-in-Time Recovery (PITR) and on-demand backups. Use TTL for short-lived items to reduce exposure. (But don’t rely on TTL alone for compliance deletion.) 4. Audit: Enable CloudTrail data events for sensitive tables where you need full visibility (note: extra cost). 5. Streams and integrations: If using DynamoDB Streams for CDC, ensure consumer apps (Lambda, Glue) run inside a VPC with least-privilege roles. Force them to write only into encrypted destinations. Example #4: Amazon Aurora (Relational Data) Amazon Aurora is a managed relational database (compatible with PostgreSQL and MySQL) that runs mission-critical workloads. Because it often stores highly sensitive transactional data, the governance model here must combine AWS controls (encryption, network) with native SQL features (roles, RLS, auditing). Steps to Implement Security 1. Network and endpoints: Deploy Aurora clusters in private subnets, never expose public endpoints. Restrict inbound rules to application security groups only, not wide CIDRs. 2. Encryption and TLS: Enable KMS CMK encryption at cluster creation. Enforce TLS connections: set rds.force_ssl=1 (Postgres) to reject non-SSL clients. 3. Identity and credentials: Store master and user credentials in AWS Secrets Manager with automatic rotation (Lambda). Use IAM Database Authentication for short-lived token-based access — integrates neatly with CloudTrail for auditing. 4. Database-level governance: Define roles with least privilege: Shell CREATE ROLE analyst NOINHERIT; GRANT USAGE ON SCHEMA sales TO analyst; GRANT SELECT (order_id, amount, region) ON sales.orders TO analyst; Enable row-level security (RLS): Shell ALTER TABLE sales.orders ENABLE ROW LEVEL SECURITY; CREATE POLICY region_isolation ON sales.orders USING (region = current_setting('app.user_region', true)); 5. Auditing: Enable pgaudit to log SELECT, DDL, and DML events as needed. Stream Aurora/Postgres logs to CloudWatch Logs; set appropriate retention policies. 6. Backups, PITR, and disaster recovery: Turn on automated backups and Point-in-Time Recovery (PITR). Regularly test restores to verify recovery SLAs.For stronger assurance, create cross-region read replicas and protect them with replicated CMKs. AWS Security Framework Cheatsheet ControlGlueRedshiftDynamodbaurora Network isolation VPC jobs, endpoints Private subnets, no public endpoint, Enhanced VPC Routing Gateway VPC Endpoint Private subnets, SG-only ingress Encryption at rest KMS on catalog, logs, job I/O KMS CMK cluster/workgroup KMS CMK table KMS CMK cluster TLS in transit VPC → endpoints Require SSL TLS to endpoint (SigV4) Enforce SSL (rds.force_ssl) Fine-grained access LF TBAC, row/cell masking RLS/CLS + masking policies + late-binding views IAM + LeadingKeys ABAC GRANTs + RLS + views/pgcrypto Secrets & auth Job role least privilege SSO/SAML + IAM roles for COPY/UNLOAD IAM roles, no static keys Secrets Manager + rotation, optional IAM DB Auth Audit & detection Catalog access logs, Glue job logs User activity log, CloudTrail, QMRs CloudTrail data events pgaudit + CloudWatch Logs Backup/Recovery ETL is stateless Snapshots, cross-region as needed PITR + on-demand backups Automated backups, PITR, cross-region replica By grounding security in seven pillars, identity, classification, network, encryption, secrets management, monitoring, and policy as code, it helps organizations gain more than guardrails; they gain a framework for sustainable and secure growth.

By Junaith Haja