With the rapid increase of mobile device usage across the globe and with the global regulatory frameworks ever-evolving, ingesting and storing behavioral data at scale comes with a number of interesting challenges.
What is data residency?
Data residency usually describes where your data is stored at rest. Data residency requirements are usually dictated by local, national, and international governing bodies through laws and customs.
This means that where your data is stored needs to be localized so that it complies with all regulations.
Privacy regulations like GDPR in the European Union and CCPA in California are major laws that usually come to mind for companies but the landscape of data residency requirements is forever changing. For example, countries like India and Brazil have comprehensive legislative frameworks that clearly state how private data needs to be stored and transferred.
An example of what this legislation can require is to store a copy of the sensitive/private data locally, process it locally, and mandate that either individuals or the government need to consent for data transfers.
Enter the Control Plane / Data Plane Architecture
While developing the Moonsense Cloud we have chosen to go for a Control Plane / Data Plane architecture. In this architecture, similarly to the separation of concerns principle, we’ve split the responsibility between how the metadata is stored (control plane) vs where the data is stored (regional data planes).
The control plane is responsible for metadata around each recording session of behavioral data, from creating the ID for this session to storing things like duration and configuration, etc. As an aggregate metadata store, the control plane is the main API that the Moonsense Console uses to provide a single view into all the data that was captured.
Each regional data plane cluster is responsible for data ingestion and data storage. It interacts with the control plane via an API and it gets deployed regionally across the world. Since we use Google Cloud, we deploy these services in a number of regions that GCP offers. We route traffic to the closest region via a Geo IP mechanism.
Advantages of regional data ingest and management
We’ve chosen to adopt this type of architecture for a number of reasons:
- Lower ingestion latency – having regional data planes, allows us to keep ingestion to a minimum and thus making behavioral data available as soon as possible to be incorporated into our customers’ risk models.
- Regional storage & compute – In order to follow strict data residency requirements, we can spin up local regions in most geographies. This allows us to keep both storage and compute local without having to centralize data storage.
- For example when it comes to local compute – for certain data types we run local DataFlow jobs that summarize data in a privacy-aware fashion.
- Better scalability & increased reliability – We can size regional data planes based on customer traffic and size the control plane based on their different workloads. Having multiple clusters deployed to different regions also means during incidents we can still ingest data from a close-by region so that our customers don’t lose behavioral data.
All in all, we found that this architecture serves our use case well – allowing flexibility while keeping complexity under control.
In a future article, we’ll discuss more how we actually store data in the data planes and what type of database we use for the centralized control plane.