Horizontal Scaling & Partitioning

Objective: Evolve the solution to support 100,000+ concurrent users with global distribution.

✅ Required: How do you partition document operations across multiple servers?

✅ Required: Compare sharding strategies by document, user, geographic region, or hybrid

✅ Required: How do you handle hot documents with thousands of concurrent editors?

✅ Optional: What's your approach to load balancing and avoiding bottlenecks?

✅ Optional: How do you handle resharding and rebalancing as the system grows?

✅ Optional: Design your consensus mechanism for distributed coordination

1. Partitioning document operations

Each operation in the system uses LSEQ, which is deterministic and idempotent. This allows operations to be processed safely across multiple nodes, supporting horizontal scaling.

In DynamoDB, the primary key pattern follows:

pk = documentId#<documentId>#operationId#<crdtOperationId>

This design ensures that each operation is unique and avoids conflicts, while still grouping operations from the same document together. As traffic grows, DynamoDB automatically spreads partitions based on the key hash, providing high throughput across multiple physical partitions.

Kafka topics are also partitioned by documentId, keeping message order per document and distributing load evenly across brokers.

2. Avoid bottlenecks in load balancer

Load balancing is done through AWS NLB for WebSocket connections and ALB for REST APIs. All Producer instances are stateless, so sticky sessions are not required.

To distribute the load evenly, the NLB uses a random (round-robin) strategy across all EC2 instances in the Auto Scaling Group or Kubernetes. This avoids concentrating too many users or documents in a single node.

To prevent overload when receiving massive event spikes (from 100k+ clients), backpressure and batching mechanisms can be applied to Kafka producers. These techniques help control throughput, reduce latency variance, and smooth sudden traffic peaks.

3. Hot documents

Hot documents occur when a single document receives thousands of concurrent operations.

To mitigate this:

Partition by documentId ensures isolation: only that partition is hot, not the entire cluster.

Local operation buffering (short-lived in-memory queue), debounce, or throttling smooths bursts before Kafka publishing.

CRDT batching can combine several local operations before persisting, reducing write amplification.

Dynamic partition reassignment may be used to temporarily offload hot partitions to dedicated nodes if needed.

We can avoid hot documents by not using sticky sessions across multiple services, as this usecase doesn't need sticky sessions and it could fix users at the same document and at the same partition that ends with a hot document.

When designing partitions we need to plan the partitioning strategy acordingly to avoid any kind of bottleneck, for example the partition IDs on Kafka topics.

4. Horizontal autoscaling and rebalancing

Both compute and data layers support elasticity:

4.1 Horizontal autoscaling (ASG or Kubernetes HPA)

Auto Scaling Groups (ASG) or Kubernetes dynamically adjust EC2 instances based on CPU/memory/network I/O and Kafka queue lag metrics.
Stateless design means rebalancing or resharding does not require session migration new nodes can join seamlessly.

4.2 NodeJS Cluster mode

We can also take advantage of the NodeJS capacity to scale horizontally by using all available cores in CPU to run in cluster mode, NodeJS do the load balancing internally automatically.

4.3 Kafka partitioning

Kafka partitions can be rebalanced automatically when scaling out consumers within a consumer group.

4.4 Data layer

DynamoDB operates in on-demand capacity mode, automatically scaling throughput per partition.
PostgreSQL can be sharded using documentId and region to avoid hot documents and reduce the latency for global users.
S3 can be partitioned using documentId, userId and region.

As the system grows, resharding strategies may evolve to a hybrid partitioning scheme (document + region or document + user), optimizing data locality for global users while minimizing cross-region latency.

5. Distributed coordination

The objective is to achieve a loosely coupled, resilient, and independently scalable architecture across compute, storage, and messaging layers.

Coordination in the critical write path is minimized to ensure low latency and high throughput. Since LSEQ-based CRDT operations are commutative and idempotent, multiple producers can concurrently process updates without requiring centralized locking, serialization, or global coordination. Each client or node independently generates operation identifiers, and all replicas eventually converge to the same state regardless of the order of delivery.

This property enables massive horizontal scalability, allowing thousands of clients to edit shared documents simultaneously while maintaining strong eventual consistency.

Coordination is introduced only where strictly necessary, primarily for system-wide orchestration, load balancing, and infrastructure-level management. These tasks include autoscaling, schema evolution, and partition assignment.

5.1 Coordination Scope

Write Path (Minimal Coordination):

Each producer independently handles CRDT operations and sends deltas to Kafka.
There is no dependency on a central coordinator for conflict resolution or sequencing since CRDTs inherently converge.
Synchronization between replicas occurs asynchronously via Kafka topics, ensuring strong eventual consistency.

Operational Path (Lightweight Coordination):

Managing partition assignments in Kafka consumers.
Performing schema migrations or snapshot compactions.
Scaling nodes in response to workload variations.
Monitoring health and maintaining distributed locks for specific background jobs (e.g., cleanup, snapshotting).

5.2 Mechanisms for coordination:

WebSockets – Real-time Event Propagation

Clients detect and propagate local edits via persistent WebSocket connections.
The server uses these connections to broadcast deltas and synchronization messages across clients editing the same document.
Because WebSocket sessions are stateless from the server's perspective (linked only to Cognito tokens), reconnection and failover are seamless.

Kafka – Event Ordering and Backpressure Control

Kafka ensures event ordering within a document's partition and acts as a coordination layer for stream processing.
Producers write asynchronously to Kafka topics, and consumers (Conflict Synchronizer service) process these events sequentially per partition.
Backpressure is handled by Kafka's internal buffering and configurable consumer lag thresholds, preventing overload during traffic spikes.

AWS Auto Scaling Groups (ASG) / Kubernetes

Infrastructure-level coordination is managed through ASG or Kubernetes Horizontal Pod Autoscaler (HPA).
Scaling events are triggered based on CPU, memory, or Kafka lag metrics.
Rolling updates and graceful node replacements ensure continuous availability during deployments.

Optional Coordination Tools – Metadata & Partition Management

Partition rebalancing.
Cluster metadata synchronization.
Leader election and failover handling.

Managed by Kafka (AWS MSK), Zookeeper or the MSK control plane.

These mechanisms are transparent to application-level logic and operate independently from the CRDT consistency model.

5.3 Failure tolerance and conflict handling:

The system is designed for high resilience under node failures or regional degradation. Each service operates independently, leveraging CRDT-based operations to maintain eventual consistency even when nodes or entire regions temporarily lose connectivity.

Node-level failures: Recovered automatically through stateless services and event replay from DynamoDB or Kafka logs.
Conflict reconciliation: Since CRDT operations are commutative and idempotent, concurrent writes from disconnected nodes are merged deterministically upon reconnection.
Graceful degradation: If coordination layers (Kafka, WebSockets, or autoscaling orchestrators) experience partial failure, services continue processing locally and reconcile once the system stabilizes.
Backpressure handling: Kafka topics and batching strategies prevent overload on ingestion pipelines, isolating failures and avoiding cascading slowdowns.

This approach ensures that temporary inconsistencies or outages do not compromise global integrity or data durability.

5.4 Cross-region coordination and failover strategy:

Given the system's offline operation support and eventual consistency model, the architecture adopts a Warm Standby with cross-region replication strategy instead of a costly Active-Active setup.

Primary and standby regions:

One primary region handles active writes and reads, while a secondary (warm standby) region maintains asynchronous replicas of PostgreSQL snapshots, DynamoDB event streams, and S3 document assets.

Data replication:

DynamoDB: Global tables replicate event streams asynchronously with minimal lag.
PostgreSQL: Logical replication streams maintain up-to-date snapshots on the standby region.
S3: Cross-region replication ensures persisted assets remain durable and geographically distributed.

Failover mechanism:

Upon primary region failure, routing (e.g., via Route 53 or Global Accelerator) redirects clients to the standby region, which promotes replicas to primary within minutes.

Deployment strategy:

Blue/Green or rolling deployments across regions ensure zero-downtime upgrades and controlled propagation of schema or topology changes.

Recovery point and time objectives (RPO/RTO):

The warm standby model achieves low RPO (near-real-time replication) and moderate RTO (a few minutes to promote standby), aligning with business continuity needs at a sustainable cost.

This approach balances availability, cost-efficiency, and operational simplicity, leveraging the natural resilience of CRDT-based offline operations to minimize impact during regional disruptions.

5.5 Observability and metrics:

Logs, metrics, and alerts monitor rebalancing, scaling events, and distributed operations.

Enables proactive detection of bottlenecks or anomalies in coordination processes.

6. Sharding Strategies

Sharding in our system is about distributing data and workload across storage, messaging, compute, and cache to avoid hotspots and scale efficiently. DynamoDB is only the event source, so sharding there is handled automatically by AWS and doesn't need manual partitioning.

6.1 Document-based sharding

Each document is assigned to a shard in PostgreSQL for snapshots. Kafka topics or partitions handle operations for that document, and workers process operations dedicated to those documents. This works well when edits are concentrated on specific documents, but a “hot” document can temporarily overload its shard. We mitigate this using local buffers, CRDT batching, and dynamic partition reassignment if needed.

6.2 User-based sharding

All operations from a single user are grouped in the same shard in PostgreSQL and processed by specific workers. This distributes load evenly across shards and compute nodes, but cross-document collaboration may generate extra cross-shard traffic in Kafka or compute layers.

6.3 Geographic sharding

Shards are separated by region, so users connect to their closest region for low-latency reads and writes. Kafka replication and PostgreSQL replication allow eventual consistency across regions. Workers process events locally, and Redis or other caches are segmented per region to keep hot data close to users.

6.4 Hybrid sharding

Combines two or more strategies, for example document + region or user + region. This balances load, reduces latency, and minimizes hotspots, but requires monitoring and dynamic redistribution as the system grows.

6.5 Trade-offs summary:

Document-based: simple and deterministic, but hot documents can overload shards.
User-based: evenly distributes load, but increases cross-shard traffic for shared documents.
Geographic: low latency for local users, but adds complexity for replication and consistency.
Hybrid: combines benefits, but increases operational complexity.

6.6 Recommended sharding strategy

Document-based sharding as the core

Each document is assigned to a shard in PostgreSQL and Kafka.
Workers process operations per document, keeping edits localized.
Hot documents can still occur, but buffering, CRDT batching, and dynamic reassignment help mitigate spikes.

Hybrid with geographic partitioning

Each document shard is replicated to the nearest region.
Users connect to the closest region, reducing read/write latency.
Cross-region replication ensures eventual consistency, while conflict resolution is handled by CRDTs.
Caches (Redis or similar) are segmented per region, keeping hot data local.

Trade-offs

Keeps shard logic simple and deterministic.
Hot documents are easier to manage because each shard corresponds to a single document.
Cross-region replication adds some complexity, but it's manageable with eventual consistency and CRDTs.
User-based load balancing is handled naturally by connecting users to the nearest region, avoiding the complexity of sharding by user.