Posted on May 1

Sharding with SpringBoot

Understanding Database Sharding

Imagine your application has grown so popular that a single database server can’t handle all your data efficiently. Sharding is like splitting your database into multiple smaller pieces (called “shards”), where each shard:

Lives on a different server
Contains the same table structure
Holds only a portion of your total data

For example, with customer data:

Shard 1 might contain customers with IDs 1–1000
Shard 2 might contain customers with IDs 1001–2000

Why We Need Sharding

Performance: Queries run faster because each database has less data to search through
Scalability: You can keep adding more shards as your data grows
Availability: If one shard goes down, the others remain available

Common Sharding Strategies

The way you divide your data affects how you’ll retrieve it:

Hash-Based Sharding
A hash function determines which shard contains the data (e.g., customerID % numberOfShards). This evenly distributes data but makes range queries difficult.
Range-Based Sharding
Data is split based on ranges of a key value (e.g., customers A-M in Shard 1, N-Z in Shard 2). Good for range queries but can lead to uneven distribution.
Directory-Based Sharding
Uses a lookup table that maps keys to specific shards. More flexible but adds an extra lookup step.

Implementing Sharding in Spring Boot

Let’s explore three approaches to implement sharding in Spring Boot, from simplest to most advanced.

Approach 1: Custom Shard Routing (AbstractRoutingDataSource)
This approach gives you direct control over which shard to query.

Step 1: Configure Multiple Data Sources

@Configuration public class DataSourceConfig { // Configuration properties for each shard @Bean @ConfigurationProperties("app.datasource.shard1") public DataSourceProperties shard1Properties() { return new DataSourceProperties(); } @Bean @ConfigurationProperties("app.datasource.shard2") public DataSourceProperties shard2Properties() { return new DataSourceProperties(); } // Create actual connections to each database shard @Bean public DataSource shard1DataSource() { return shard1Properties().initializeDataSourceBuilder().build(); } @Bean public DataSource shard2DataSource() { return shard2Properties().initializeDataSourceBuilder().build(); } // Create a routing datasource that will switch between shards @Bean public DataSource routingDataSource() { Map<Object, Object> dataSources = new HashMap<>(); dataSources.put("shard1", shard1DataSource()); dataSources.put("shard2", shard2DataSource()); ShardRoutingDataSource routingDataSource = new ShardRoutingDataSource(); routingDataSource.setTargetDataSources(dataSources); routingDataSource.setDefaultTargetDataSource(shard1DataSource()); // Fallback shard return routingDataSource; } // Set up transaction management @Bean public PlatformTransactionManager transactionManager() { return new DataSourceTransactionManager(routingDataSource()); } // Configure the entity manager factory for JPA @Bean public LocalContainerEntityManagerFactoryBean entityManagerFactory(EntityManagerFactoryBuilder builder) { return builder .dataSource(routingDataSource()) .packages("com.example.model") // Your entity package .persistenceUnit("shardedPersistenceUnit") .build(); } // Make this our primary DataSource that the application will use @Primary @Bean public DataSource dataSource() { // LazyConnectionDataSourceProxy improves performance by only // connecting to the database when actually needed return new LazyConnectionDataSourceProxy(routingDataSource()); } }

Step 2: Create a Custom Routing DataSource
This class extends Spring’s built-in routing capability to decide which shard to use:

public class ShardRoutingDataSource extends AbstractRoutingDataSource { // ThreadLocal variable to store current shard for each thread/request private static final ThreadLocal<String> currentShard = new ThreadLocal<>(); // Set which shard to use public static void setCurrentShard(String shardId) { currentShard.set(shardId); } // Get the current shard to use public static String getCurrentShard() { return currentShard.get(); } // Clean up after we're done - important to prevent memory leaks! public static void clearCurrentShard() { currentShard.remove(); } @Override protected Object determineCurrentLookupKey() { // This method asks our ThreadLocal which shard to use right now return getCurrentShard(); } }

Step 3: Create a Shard Selection Strategy
This service determines which shard should contain a particular piece of data:

@Service public class ShardingService { /** * Determines which shard contains data for a given customer ID * * @param customerId The customer ID to look up * @return The shard key (e.g., "shard1" or "shard2") */ public String determineShardKey(Long customerId) { // This is a simple example using the modulo operator // If customerId is odd, use shard1; if even, use shard2 return "shard" + (customerId % 2 + 1); // In real applications, you might use more sophisticated strategies: // - Hash-based: using a hash function on the ID // - Range-based: specific ID ranges go to specific shards // - Geography-based: users from different regions go to different shards } }

Step 4: Using the Sharding System in a Repository

@Repository public class CustomerRepository { @Autowired private JdbcTemplate jdbcTemplate; // Spring's helper for database queries @Autowired private ShardingService shardingService; /** * Find a customer by their ID * * @param customerId The customer to look up * @return The customer information */ public Customer findById(Long customerId) { try { // Step 1: Figure out which shard has this customer String shardKey = shardingService.determineShardKey(customerId); // Step 2: Tell our system to use that shard ShardRoutingDataSource.setCurrentShard(shardKey); // Step 3: Run the query (it will automatically go to the right shard) return jdbcTemplate.queryForObject( "SELECT * FROM customers WHERE id = ?", new Object[]{customerId}, (resultSet, rowNum) -> { Customer customer = new Customer(); customer.setId(resultSet.getLong("id")); customer.setName(resultSet.getString("name")); customer.setEmail(resultSet.getString("email")); // Set other fields... return customer; } ); } finally { // Step 4: Always clean up the shard selection when done ShardRoutingDataSource.clearCurrentShard(); } } }

Step 5: A Simple Controller Example

@RestController @RequestMapping("/api/customers") public class CustomerController { @Autowired private CustomerRepository customerRepository; @GetMapping("/{id}") public ResponseEntity<Customer> getCustomer(@PathVariable Long id) { try { Customer customer = customerRepository.findById(id); return ResponseEntity.ok(customer); } catch (EmptyResultDataAccessException e) { return ResponseEntity.notFound().build(); } } }

Step 6: Configuration in application.properties

# Shard 1 Database Connection Info app.datasource.shard1.url=jdbc:mysql://shard1-database-server:3306/customers_db app.datasource.shard1.username=db_user app.datasource.shard1.password=db_password app.datasource.shard1.driver-class-name=com.mysql.cj.jdbc.Driver # Shard 2 Database Connection Info app.datasource.shard2.url=jdbc:mysql://shard2-database-server:3306/customers_db app.datasource.shard2.username=db_user app.datasource.shard2.password=db_password app.datasource.shard2.driver-class-name=com.mysql.cj.jdbc.Driver

Approach 2: Using Apache ShardingSphere
For more complex sharding scenarios, Apache ShardingSphere provides a comprehensive solution that abstracts much of the complexity.

Step 1: Add Dependencies

<dependency> <groupId>org.apache.shardingsphere</groupId> <artifactId>shardingsphere-jdbc-spring-boot-starter</artifactId> <version>5.3.0</version> </dependency> <!-- For distributed transactions, add this: --> <dependency> <groupId>org.apache.shardingsphere</groupId> <artifactId>shardingsphere-transaction-xa-core</artifactId> <version>5.3.0</version> </dependency> Step 2: Configure ShardingSphere in application.yml spring: shardingsphere: mode: type: Standalone repository: type: JDBC datasource: names: ds0, ds1 ds0: type: com.zaxxer.hikari.HikariDataSource driverClassName: com.mysql.cj.jdbc.Driver jdbcUrl: jdbc:mysql://localhost:3306/shard0 username: root password: password maxPoolSize: 10 ds1: type: com.zaxxer.hikari.HikariDataSource driverClassName: com.mysql.cj.jdbc.Driver jdbcUrl: jdbc:mysql://localhost:3307/shard1 username: root password: password maxPoolSize: 10 rules: sharding: default-database-strategy: standard: sharding-column: customer_id sharding-algorithm-name: database-inline sharding-algorithms: database-inline: type: INLINE props: algorithm-expression: ds${customer_id % 2} tables: customers: actual-data-nodes: ds${0..1}.customers orders: actual-data-nodes: ds${0..1}.orders database-strategy: standard: sharding-column: customer_id sharding-algorithm-name: database-inline key-generate-strategy: column: order_id key-generator-name: snowflake key-generators: snowflake: type: SNOWFLAKE props: sql-show: true # Log sharded SQL for debugging

Step 3: Use Standard JPA Repositories
With ShardingSphere, you can use standard Spring Data repositories:

@Entity @Table(name = "customers") public class Customer { @Id private Long id; private String name; private String email; // Getters and setters } @Repository public interface CustomerRepository extends JpaRepository<Customer, Long> { // Regular JPA methods - ShardingSphere handles the routing List<Customer> findByNameContaining(String nameFragment); }

Step 4: Service Layer with Transaction Management
ShardingSphere supports distributed transactions using XA:

@Service public class CustomerOrderService { @Autowired private CustomerRepository customerRepository; @Autowired private OrderRepository orderRepository; // This transaction will automatically be distributed across shards @Transactional(rollbackFor = Exception.class) public void createCustomerWithOrder(Customer customer, Order order) { customerRepository.save(customer); order.setCustomerId(customer.getId()); orderRepository.save(order); } }

Step 5: ShardingSphere’s Built-in Query Pagination
ShardingSphere automatically handles cross-shard pagination for you:

@Service public class CustomerService { @Autowired private CustomerRepository customerRepository; public Page<Customer> findAllCustomers(int page, int size) { // ShardingSphere automatically merges and orders results from all shards return customerRepository.findAll(PageRequest.of(page, size, Sort.by("id"))); } }

ShardingSphere Architecture Overview

ShardingSphere provides several key components:

ShardingSphere-JDBC: A client-side solution that intercepts JDBC calls and routes them to the appropriate shards
ShardingSphere-Proxy: A server-side database proxy that you can connect to like a regular database
ShardingSphere-Sidecar: Service mesh integration (for cloud-native deployments)
For most Spring Boot applications, ShardingSphere-JDBC is the preferred approach.

Approach 3: Using a Data Virtualization Layer
For complex environments with multiple data source types beyond just SQL databases, a data virtualization platform can provide a unified view.

Data Virtualization Architecture

A typical data virtualization setup includes:

Data Source Connectors: Connect to various data sources (SQL, NoSQL, API, etc.)
Query Engine: Optimizes and routes queries to appropriate data sources
Caching Layer: Improves performance by caching frequently accessed data
Virtual Database Layer: Presents a unified view through standard interfaces (JDBC, ODBC)

Step 1: Set Up a Data Virtualization Platform
Popular options include:

Denodo Platform
Dremio
IBM Data Virtualization Manager
Red Hat JBoss Data Virtualization

Step 2: Define Virtual Views
In your virtualization platform, define virtual views that combine data from multiple shards.

For example, in Denodo:

CREATE VIEW global_customers AS SELECT * FROM shard1_customers UNION ALL SELECT * FROM shard2_customers;

Step 3: Connect Spring Boot to the Virtualization Layer

@Configuration public class DataVirtualizationConfig { @Bean @ConfigurationProperties("spring.datasource") public DataSource virtualDataSource() { return DataSourceBuilder.create().build(); } @Bean public JdbcTemplate jdbcTemplate() { return new JdbcTemplate(virtualDataSource()); } }

Step 4: Use Standard JPA or JDBC to Query Virtual Views

@Repository public class VirtualizedCustomerRepository { @Autowired private JdbcTemplate jdbcTemplate; public List<Customer> findAllCustomers(int page, int size) { return jdbcTemplate.query( "SELECT * FROM global_customers ORDER BY id LIMIT ? OFFSET ?", new Object[]{size, page * size}, customerRowMapper ); } }

Advantages of Data Virtualization

Provides a unified view across disparate data sources
Abstracts away the complexity of sharding from application code
Enables real-time data integration without ETL processes
Supports complex data transformations and advanced query capabilities

Transaction Management in Sharded Environments
Transaction management becomes challenging in sharded environments because traditional database transactions only work within a single database.

Single-Shard Transactions
These work just like regular database transactions:

@Transactional public void updateCustomerDetails(Long customerId, String newName) { // ShardingContextHolder.setCurrentShard is handled by an aspect customerRepository.updateName(customerId, newName); auditRepository.logNameChange(customerId, newName); // Both operations occur in the same shard and form a single transaction }

Cross-Shard Transaction Patterns

Two-Phase Commit (2PC) The traditional approach for distributed transactions:

Prepare Phase: Each shard is asked if it can commit
Commit Phase: If all shards agree, they all commit; otherwise, all rollback
ShardingSphere supports 2PC using the XA protocol:

@Service public class CrossShardService { // ShardingSphere handles the 2PC under the hood @ShardingSphereTransactionType(TransactionType.XA) @Transactional public void transferBetweenCustomers(Long fromCustomerId, Long toCustomerId, BigDecimal amount) { // These might be in different shards, but XA handles it accountRepository.deductBalance(fromCustomerId, amount); accountRepository.addBalance(toCustomerId, amount); } }

Saga Pattern A sequence of local transactions with compensating actions for failures:

@Service public class SagaTransactionService { @Autowired private AccountRepository accountRepository; @Autowired private TransferRepository transferRepository; // Not a @Transactional method - we manage the transaction flow manually public boolean transferWithSaga(Long fromId, Long toId, BigDecimal amount) { // Step 1: Record the transfer intent Long transferId = recordTransferIntent(fromId, toId, amount); try { // Step 2: Deduct from source account (Shard 1) if (!deductFromAccount(fromId, amount, transferId)) { return false; // Failed at step 2 } // Step 3: Add to destination account (Shard 2) if (!addToAccount(toId, amount, transferId)) { // Compensating transaction for step 2 refundAccount(fromId, amount, transferId); return false; // Failed at step 3 } // Step 4: Mark transfer as complete markTransferComplete(transferId); return true; } catch (Exception e) { // Handle unexpected errors, apply compensating transactions handleSagaFailure(transferId, fromId, toId, amount); return false; } } // Methods to implement each step and compensating actions... }

Eventual Consistency with Event-Based Approach Using events for asynchronous processing across shards:

@Service public class EventBasedTransactionService { @Autowired private AccountRepository accountRepository; @Autowired private ApplicationEventPublisher eventPublisher; @Transactional // Local transaction in a single shard public void initiateTransfer(Long fromId, BigDecimal amount, Long toId) { // Deduct from source account accountRepository.deductBalance(fromId, amount); // Publish event for the second part of the transaction eventPublisher.publishEvent(new FundsTransferredEvent(fromId, toId, amount)); } @EventListener @Transactional // Will run in the appropriate shard based on toId public void handleFundsTransfer(FundsTransferredEvent event) { // Add to destination account accountRepository.addBalance(event.getToId(), event.getAmount()); } }

Transaction Isolation in Sharded Environments
Be aware of limitations:

No global read consistency across shards
Potential for partial failures in distributed transactions
Transaction isolation only applies within a single shard

Approach Comparison: Pros and Cons

Approach: AbstractRoutingDataSource

Pros:

Complete control over routing
No extra dependencies
Easier to understand and debug
Works with any SQL database

Cons:

Manual implementation of cross-shard operations
No built-in support for distributed transactions
Cross-shard pagination is complex
More boilerplate code

Approach: Apache ShardingSphere

Pros:

Handles cross-shard queries automatically
Built-in distributed transaction support
SQL parser with query optimization
Supports various sharding algorithms
Transparent to application code

Cons:

Additional dependency
Learning curve for configuration
Less flexibility for custom routing logic
Performance overhead for query parsing

Approach: Data Virtualization

Pros:

Unified view across multiple data sources
Can integrate non-SQL data sources
Query optimization across sources
Application completely unaware of sharding

Cons:

Usually requires commercial licenses
Higher infrastructure complexity
Potential bottleneck for performance
Limited transaction capabilities across sources

Best Practices:
Start Simple: Begin with the AbstractRoutingDataSource approach before moving to more complex solutions
Add Logging: Log which shard is being accessed for each query during development
Create Unit Tests: Verify your sharding logic with tests against each shard
Choose Smart Shard Keys: Select keys that distribute data evenly and minimize cross-shard queries
Keep Related Data Together: Try to keep related data on the same shard to avoid complex cross-shard operations
Use Transactions Carefully: Be aware of the limitations of transactions across shards
Plan for Growth: Design your sharding strategy to accommodate adding more shards in the future
Handle Errors Gracefully: Always include robust error handling for shard routing failures
Implement Circuit Breakers: Use patterns like Circuit Breaker for resilience against shard failures
Monitor Shard Distribution: Track the data distribution across shards to detect imbalances

Next Steps for Learning

Start with a simple two-shard setup as shown here
Add logging to see which shard is being used for each query
Create unit tests that verify your sharding logic
Practice implementing cross-shard queries
Experiment with different sharding strategies

How did I write this post:

Prompt 1: You are an experienced software developer. Can you explain me how would implement fetching of data from sharded databases in SpringBoot for SQL database?

And a few more refinements using other LLM Tools.