Lessons from Building a Scalable Cloud Connectivity Platform

14 min readJan 26, 2025

TL;DR:
When your organization reaches a certain level of maturity and transitions to a multi-account setup, it’s important to shift your perspective on networking. Avoid looking at it from a tenant-level view — zoom out and consider the network globally across multiple tenants.
Design your network for extensibility from the start. Ensure your provisioning process is agile yet reliable to handle frequent changes without risking outages.
Focus on centralized control, a unified control plane, modular architecture, rigorous testing, and robust change management to scale your cloud networking effectively.

CIDRs, Subnets & Secrets: Is Your Cloud Networking Built on Tribal Knowledge?

• Is your cloud network still configured manually?

• Are CIDRs and subnets inconsistent or conflicting?

• Does something break every time you make a network change?

• Is the one person who understands the network on vacation, leaving everything at a standstill?

• Are your cloud networking costs spiraling out of control?

• Is performance throttled by network bottlenecks?

• Does adding a new account, region, or multi-cloud/on-prem connectivity feel like an endless process, delaying your time to market?

If you nodded along to any of these, you’re not alone — and this blog post is for you. Let’s dive into the lessons I’ve learned from building a scalable cloud connectivity platform to solve these challenges once and for all.

What is a Cloud Connectivity Platform?

A Cloud Connectivity Platform is a unified system that integrates cloud networking services and proxy technologies to provide secure, “any-to-any” connectivity.

What does “any-to-any” connectivity mean?

When I think of “any-to-any” connectivity, I think of scenarios I’ve encountered repeatedly:

Applications communicating with other applications.
Applications securely accessing databases.
Apps Accessing APIs Securely
Systems reaching the internet or connecting across multi-region and multi-cloud environments.
Integrating on-prem and branch office networks with cloud infrastructure.

The Three Layers of a Cloud Connectivity Platform

In my experience, a well-designed platform is built on three foundational layers:

1. Cloud Networking Layer

This layer handles the foundational infrastructure — VPNs, VNets, and Gateways. It’s what ensures global “any-to-any” connectivity and keeps the underlying plumbing intact.

2. Application Networking Layer

Think API gateways, ingress/egress gateways, and sidecar proxies. This layer secures application-to-application communication and abstracts away complexity, so developers can focus on building features, not debugging network configurations.

Efficiency of the application networking layer is entirely dependent on the the cloud networking layer. If the underlying network isn’t optimized, even the best-designed application networking layer will struggle.

Whether the cloud networking layer and application networking layer are managed by the same team or by different teams, they must work in cohesion to ensure efficiency.

3. Platform Abstraction Layer

This is the magic sauce that hides the complexity of networking from developers and operators. It provides a unified interface, so teams don’t have to get into the weeds of CIDRs, subnets, or route tables.

Observability and security are woven across all three layers, but I’ll save the details of those for another conversation. Here, we’ll focus on the networking aspects.

Key Features of a Cloud Connectivity Platform

When I reflect on the platforms I’ve helped build and operate, these are the key capabilities that make them successful:

1. “Any-to-Any” Connectivity

A platform must deeply integrate with cloud networks, regions, and on-prem systems. It should make cross-region, cross-cloud, and cross-environment communication feel seamless.

2. Platform Abstraction

One thing I’ve learned is that simplicity is critical. A good platform abstracts the underlying networking complexities:

SREs who deploy compute, storage and other services shouldn’t have to worry about the connetivity.
Application teams shouldn’t even have to think about CIDRs or subnets. They just need connectivity to work.

3. Unified Observability

Visibility is non-negotiable. A robust platform provides insights at every level:

Layer 4: Flow logs provide insights to troubleshoot connectivity and security issues. This information helps networking SREs understand data flow, reason network costs, and refine the network design for improved efficiency and performance.
Layer 7: Application-layer metrics, such as access logs, can help you troubleshoot issues related to Layer 7 connectivity.

4. Agility and Testing

Networks are constantly evolving — new regions, new accounts, new environments. A scalable platform must adapt without breaking.

Rigorous testing is key. I’ve seen firsthand how poorly tested changes can lead to catastrophic outages. A solid connectivity platform includes tools to validate configurations before they go live.

Part 1: Lessons from Building a Scalable Cloud Connectivity Platform — Cloud Networking

In this blog, I’ll dive into the foundational layer: Cloud Networking, sharing key lessons and insights. For lessons on Application Networking and the Platform Abstraction Layer, stay tuned for Part 2 — it’s coming soon!

The Evolution of Cloud Networking: YellaTalk — A Startup’s Cloud Journey

Let’s explore the journey of an imaginary startup, YellaTalk, a peer-to-peer communication app, as it navigates cloud networking challenges.

As a startup building on the cloud, the initial focus is typically on delivering features rather than perfecting infrastructure. The reasons might include:

Limited resources: Startups often lack the capacity to build ideal cloud infrastructure upfront. They might not even have dedicated DevOps engineers or SREs, leaving developers to handle initial cloud provisioning.
Overengineering risk: Even with an experienced SRE team, implementing a best-practice landing zone architecture from the start could be unnecessary and overly complex.

For those unfamiliar, a landing zone architecture is a best-practice account structure provided by cloud providers like AWS, Azure, and GCP. It’s typically adopted by large companies or those with strict compliance requirements. However, for startups like YellaTalk, it can be too complex to implement in the early stages.

YellaTalk’s Initial Setup

To get started, YellaTalk’s engineers create two AWS accounts in the UAE region: one for staging and another for production. Using Terraform, they provision VPC CIDRs, subnets, and other resources like compute and storage.

Scaling with Spinouts

As YellaTalk achieves success, they decide to launch a spinout, YellaShop, an e-commerce platform. To manage budgets separately, engineers provision new staging and production accounts for YellaShop.

As developers deploy services for YellaShop, they realize the need to access platform services like identity, profiles, and notifications hosted on YellaTalk. While they could use public APIs, latency concerns arise since traffic must traverse the internet.

Networking Simplicity: VPC Peering

A straightforward solution is implemented: VPC peering between YellaTalk and YellaShop. This approach eliminates the need to route traffic through the internet, reducing latency.

Further Growth: Adding YellaPay

YellaShop grows, prompting the creation of YellaPay, a secure payment platform hosted in its own AWS accounts. Services in YellaTalk and YellaShop need to access payment APIs without relying on internet traffic. The team implements VPC endpoints in YellaPay to enable private connectivity.

As YellaPay services grow, they also require access to platform services hosted on YellaTalk. VPC endpoints are added to YellaTalk account for services like identity and Kafka. Similarly, YellaPay ships logs and metrics to observability systems hosted in YellaTalk, adding even more endpoints.

The Complexity Begins

As more services and accounts are introduced, managing VPC endpoints and interconnectivity becomes increasingly complex. Questions arise:

Which team manages day-two operations of these endpoints?
How will you manage connectivity if YellaPay needs to hosted Saudi Arabia for data residency requirements?
What about providing connectivity to GCP or Azure for data analytics or AI use cases?

Hitting the Complexity Wall

As Yella grows, the once-simple cloud networking setup becomes a bottleneck. What began as a straightforward, developer-driven model struggles to scale with the increasing workload diversity and geographic expansion.

It’s crucial to recognize the symptoms of hitting the “complexity wall” early and address cloud connectivity challenges before they escalate. Once complexity reaches a certain threshold, it becomes difficult to justify the risks and costs associated with a major network redesign, which could involve downtime and operational risks.

Common Challenges in Scaling Cloud Networking

Scaling cloud networking isn’t just about infrastructure — it’s about people, processes, and the inevitable challenges that come with growth. Here are three key categories of challenges I’ve noticed when it comes to scaling cloud networking:

1. Fragmented and Manual Configurations

• Point-to-point solutions don’t scale: Fragmented network setups, like multiple VPC peering connections or scattered VPC endpoints, can quickly become a bottleneck. Maintaining dozens (or even hundreds) of these isolated configurations is operationally challenging and unsustainable.

• Lack of a clear process for network changes: Without a well-defined process, manual changes can introduce drift between your state files and the actual infrastructure. The same applies to inconsistencies across accounts, making troubleshooting and scaling a nightmare.

• Inconsistent tagging practices: If you don’t have standardized tagging for your networking resources, things can spiral out of control. Startups often skip this step early on because they don’t have a dedicated networking team. But when you scale, this lack of standardization makes it almost impossible to maintain visibility into what resources are provisioned and why. Centralized visibility is critical to avoid this.

2. Day-2 Operations

• Time-consuming and error-prone changes: Changing network configurations often becomes a manual, complex, and time-intensive process. If your team relies heavily on one or two key people for these changes, you’re in trouble if they’re unavailable — whether it’s a vacation or a sudden departure.

• Unintended changes: If everyone on the Infra team has access to manage networking resources like VPC endpoints, load balancers, or DNS records, it’s only a matter of time before an unintended change breaks something. Without proper processes and permissions in place, this risk increases as the organization scales.

• Lack of centralized visibility: Without proper tagging and centralized monitoring, your team is effectively flying blind. No one really understands what the underlying network looks like, and that makes it almost impossible to operate efficiently.

3. Security and Cost

• Limited visibility into traffic patterns: If you don’t have insights into how traffic flows within your infrastructure or between your infrastructure and the internet, you’re opening the door to security risks. This lack of visibility also makes it harder to optimize costs.

Unexpected costs: Without careful monitoring, traffic patterns — like egress traffic, NAT gateway usage, or inter-AZ data transfer — can rack up significant costs. These issues tend to sneak up on teams, especially when there’s no cost-awareness strategy in place.

Why Does Cloud Networking Need to Be Agile?

Let’s talk about why agility is just as critical for cloud networking as it is for delivering applications. In today’s fast-paced world, businesses prioritize speed — rolling out features, scaling operations, and meeting user demands. But here’s the catch: if your networking layer isn’t as agile as the rest of your stack, it becomes a bottleneck.

Cloud networking must be adaptable, flexible, and reliable to support changes like these:

1. Business Growth and Expansion

As businesses grow, the network needs to grow with them:

Multi-Region Deployments: Expanding into new regions improves user experience and enables disaster recovery. Your network must handle this seamlessly.
Multi-Cloud Strategies: You might start in AWS, but as needs evolve — say for AI, analytics, or compliance — you might add Azure or GCP into the mix. Your network has to connect these clouds without missing a beat.

2. Acquisitions and Mergers

If your company acquires another business, you’re suddenly tasked with integrating two separate networks. That might mean:

Connecting cloud environments.
Resolving overlapping IP ranges.
Rebuilding parts of the network to make everything compatible.

Without an agile network, this can slow down business integration significantly.

3. Cost Optimization

Networking costs can spiral out of control without proper planning. For example:

Data transfers between availability zones (cross-AZ).
Overuse of NAT gateways, VPC endpoints.

An agile network allows you to quickly adapt and optimize costs.

4. Security and Compliance

As companies mature, security and compliance requirements become stricter, and the network needs to adapt. For example:

Environment segregation: Isolating environments to meet security or regulatory standards.
Enhanced security: Adding service meshes or firewalls to protect east-west traffic within your network.
Compliance changes: Laws like GDPR or HIPAA may require data residency, stricter access controls, or auditability.

Solving Yella Corp’s Networking Challenges

Let’s take a step back and explore a potential solution to address Yella Corp’s growing networking complexity.

Introducing a Central Networking Account with a Transit Gateway

One effective approach is to implement a central networking account and leverage AWS Transit Gateway to manage network traffic globally. This creates a hub-and-spoke networking model, where the central account acts as the hub and connects to multiple spoke VPCs.

The advantages of this approach include:

• Simplified traffic management: The central hub handles both ingress (traffic entering the network) and egress (traffic leaving the network), instead of managing these flows at each spoke network.

• Centralized security: A central firewall in the networking account can inspect north-south (external to internal) and east-west (internal-to-internal) traffic, providing a unified security layer.

• Scalability: The transit gateway can extend connectivity across VPCs, regions, and even multiple cloud providers (for hybrid or multi-cloud setups).

For more details, AWS provides an excellent explanation of the hub-and-spoke model in their Well-Architected Framework documentation.

The Bigger Picture: Think Globally

I want to emphasize that my intention isn’t to suggest Transit Gateway as the one-size-fits-all solution. Instead, it’s about shifting the mindset to view networking globally rather than managing it piecemeal at the individual tenant or account level.

5 Lessons from Building Scalable Cloud Networking

Based on my experience working with large-scale enterprises like Careem and SAP, here are five key lessons for building a scalable cloud networking platform:

Lesson 1: Centralized Control for Network Management

Networking resources must be protected from unintended changes. A small mistake in the network layer can have a massive impact, which makes centralized control essential.

1. Separation of Concerns:

Keep networking provisioning separate from infrastructure provisioning.
This prevents unintentional changes during routine deployments.

2. Role-Based Access Control (RBAC):

Create specific roles for networking administrators and control-plane operators with limited, well-defined permissions.
Restrict direct modifications of networking resources through UI portals or management consoles.
Assign a dedicated Networking SRE team to manage critical components like VPCs, subnets, route tables, and gateways.
Limit access for other teams to avoid accidental changes.

3. Organizational Policies:

Set policies at the organizational level to block destructive actions, such as deleting critical resources like VPCs or Transit Gateway attachments.
For sensitive actions, enforce privilege escalation with multiple levels of approval.
Always test and document changes in non-production environments before applying them in production.

Lesson 2: Centralize Cloud Connectivity with a Unified Control Plane

Always maintain a global view of your networking by implementing a centralized control plane to configure data plane components like Transit Gateways.

1. Unified Control Plane Options:

A simple solution could be using a GitHub pipeline with Terraform to provision resources.
You could also use Pulumi for managing configurations or tools like Crossplane in the Kubernetes ecosystem to handle data plane components like Transit Gateways.
Alternatively, you can build your own control plane tailored to your organization’s needs.

2. Flexible Data Plane Options:

Start simple, such as with a Transit Gateway or a VPN Gateway.
As you scale, you can adapt and combine different technologies to meet evolving connectivity needs.

Here is a reference implementation of a control plane built with modular Terraform modules: Control Plane Implementation.

Controlpane Spans Across Multiple Clouds

Lesson 3: Modular and Composable Architecture

In application development, moving from monolithic to modular architectures has enabled faster innovation and reduced complexity. The same principle applies to networking: tightly coupled network components create bottlenecks and increase risk. A modular, composable architecture allows for testing, making changes, and isolating issues without impacting the wider network infrastructure.

Modular Infrastructure Provisioning Scripts

Think of infrastructure provisioning as APIs. Write modular APIs that handle specific tasks. For example:

def create_vpc(req: VPC):  
    # Logic to create a VPC  

def create_tgw_attachment(req: TGWAttachment):  
    # Logic to attach a Transit Gateway  

def create_zone_attachment(req):  
    # Logic to attach a zone

With this approach, you can test these scripts independently, ensuring reliability before integration. Reference modular design patterns like Terraform module creation.

The benefits of this approach include the ability to independently test smaller modules and stack them together to build larger components, enabling seamless interoperability testing.

Modular Network Design

Monolithic designs, such as a single, large route table managing all traffic or a single VPC housing all applications and environments, make it challenging to test configurations, isolate changes, or contain issues.

Instead, adopt a modular approach:

• Split Monolithic Accounts or VPCs: Create separate accounts or VPCs for shared services, observability, and data platforms to reduce complexity, standardize configurations, and improve routing efficiency.

Isolate Network Components: Manage subnets, gateways, and route tables as independent, testable components to enable changes without affecting the entire network.

Lesson 4: Testing Your Network Changes

Just like in application development, testing infrastructure changes is critical for rolling out reliable network updates. Without proper testing, even minor changes can lead to downtime or performance issues. Here’s how to approach it:

Unit Testing

Start with your network provisioning scripts.
Validate input parameters to catch misconfigurations early.
Check for proper error handling in failure scenarios.
Ensure individual functions (e.g., creating a VPC or Transit Gateway attachment) work as intended.

You can implement policies on IaC plans to enforce validations and guardrails.
For example, I have used Conftest to block the deletion of resources. You can find more details here: Validation the Plan.

2. Integration Testing

Validate end-to-end connectivity between networks.
Test how network provisioning modules interact with each other. (e.g., VPC-to-TGW, Transit Gateway-to-Route Tables).
Deploy changes in a staging environment to catch issues before production.

In the Traffic Platform proof of concept, I have implemented Terratest to test the Terraform modules. For more details, you can refer to the documentation here: Testing Terraform Modules.

3. Reachability and Synthetic Tests

Go beyond static testing with dynamic validation.
Auto-generate test cases based on the intended network design.
Test both success and failure scenarios for critical paths.
Run continuous synthetic probes to monitor reachability and performance.

Here is a reference implementation for auto-generating Reachability Tests:
Reachability Analyzer Validation.
You can run the analysis after every deployment to ensure nothing is broken.

Lesson 5: Change Management Process

Managing changes to cloud networking requires a structured approach to ensure reliability and minimize disruption. Here’s how to handle it effectively:

1. Impact Analysis
Before making any changes, assess their potential effects on traffic, system performance, and dependent components. Document the risks and create a clear plan for both rollout and rollback procedures.

2. Pre-Change Testing and Validation
Always test changes in a non-production environment.

Validate the rollout and rollback process to ensure smooth recovery if needed.
Use tools like terraform plan or equivalent to preview and verify changes before deployment.

3. Change Windows
Schedule changes during low-traffic or non-business hours to reduce the impact on users. This minimizes the likelihood of disruptions during peak periods.

4. Break-the-Glass Protocol
For high-risk changes, such as deleting a Transit Gateway attachment or modifying route tables, implement a privilege escalation protocol to ensure proper oversight and accountability.

Require leadership-level approval for privilege escalation.
Thoroughly review and document all actions to maintain accountability and traceability.

In emergency situations, where urgent changes to critical resources are required, adopt a break-the-glass protocol. This protocol allows controlled, temporary access to management controls for resolving high-priority incidents. Ensure the process is well-documented and used only as a last resort.

Key Takeaways

When your organization reaches a certain level of maturity and transitions to a multi-account setup, it’s important to shift your perspective on networking. Avoid looking at it from a tenant-level view — zoom out and consider the network globally across multiple tenants.

Your network architecture will inevitably evolve, so it’s essential to design for extensibility from the start.

As your cloud networking design evolves, you will deploy numerous network configuration changes. Your network provisioning process must be agile to keep pace with the rest of your infrastructure and applications. At the same time, it must prioritize reliability, as network outages often have a far greater impact than typical application outages.

Adopting centralized control, a unified control plane, a modular and composable architecture, rigorous testing practices, and a robust change management process can enable you to scale cloud networking with both agility and reliability.