Modeling Data in CosmosDB: Transitioning Away from the SQL Paradigm

Nicolas DESCARTES

5.00/5 (1 vote)

Dec 11, 2023

CPOL

11 min read

5240

How to model data in Azure CosmosDB with an SQL background?

Introduction

Document-oriented databases are a type of NoSQL database that store and retrieve data in a format similar to JSON. In these databases, data is organized as documents, which are collections of key-value pairs or field-value pairs. Each document can contain nested structures, arrays, and other complex data types.

Unlike relational databases, document-oriented databases don't require a predefined schema, allowing for more flexibility in the data structure. This flexibility is particularly advantageous when dealing with dynamic or semi-structured data. Document-oriented databases are often used in scenarios where data can vary between different records.
Popular document-oriented databases include MongoDB, CouchDB, and Azure Cosmos DB. These databases are widely used in modern web applications, content management systems, and other scenarios where flexible and scalable data storage is essential.

In this article, we will delve into the core principles and demonstrate the shift between the relational paradigm and the document-oriented one. To illustrate the underlying concepts, we will model an ecommerce application, initially in a manner reminiscent of relational databases, and progressively migrate it towards a document-oriented approach.

We referred to the following book to elucidate certain concepts.

NoSQL Distilled (Sadalage, Fowler)

This article is an adaptation of a post originally published here.

What are the Conventional Approaches to Data Modeling?

Developers have long been accustomed to modeling data with relational databases. This paradigm is easy to understand and, in fact, allows us to implement many problems simply. Initiated by Codd in the seventies, it has since widely diffused, becoming a cornerstone technology, often unquestionably associated with SQL derivatives. However, since the late 2000s and the necessity to manage vast amounts of data efficiently, new concepts have emerged, introducing novel ways of modeling data. Shifting one's mindset to this new approach can be challenging, and it is our aim here to clarify all the notions involved.

Traditionally, data was modeled with a relational database and involved several key steps.

Identify the main entities in our system (objects or concepts about which we need to store information like customers, orders, products).
Define the relationships between entities and determine how entities are related to each other (e.g., a customer places an order, an order contains products, etc.).
Apply normalization techniques to eliminate redundancy and improve data integrity. Normalize tables ensure efficient storage and prevent unnecessary duplication.

We will now witness an example in action.

How to Model an Ecommerce Application in a Relational Database ?

We will start with a simple example that everyone is familiar with: an online store where people can place orders, view products, and write reviews. From a relational point of view, we first need to define the entities and the relationships between the different concepts. Let's briefly review them.

Products

Products are items sold on the online store. They are primarily characterized by their ID, name, description, unit price, image, URL and a category.

Feature	Type
`Id`	unique identifier
`Name`	string
`Description`	string
`UnitPrice`	decimal
`ImageUrl`	string
`Url`	string
`CategoryId`	unique identifier
...	...

Customers

Customers are individuals who purchase products on the online store. They are characterized by their ID, name, address, and other essential data.

Feature	Type
Id	unique identifier
Name	string
Country	string
...	...

Orders

Orders are placed by customers to purchase products. They are characterized by their reference and a list of order lines.

Feature	Type
`Reference`	unique identifier
`CustomerId`	unique identifier
`Lines`	List
...	...

Reviews

A review on a product is typically a written evaluation or assessment of a product by a customer. It often includes feedback and ratings regarding the product's features, performance, quality, and overall satisfaction.

Feature	Type
`ReviewId`	unique identifier
`CustomerId`	unique identifier
`ProductId`	unique identifier
`Comment`	string
`Rating`	integer
...	...

Once implemented in a relational database (for example, MySQL or SQL Server), the entities and relationships can look like the following:

Here are some fundamental aspects of this schema:

It comprises 8 tables. Typically, in SQL, it is a standard practice to create a table for each new concept or relationship between concepts and some databases can contain hundreds or thousands tables.
Each table adheres to a predefined and rigid schema, containing structured data. This enforces referential integrity.
SQL provides the capability to query a wide range of data. While some queries will execute incredibly quickly, others may be unreasonably slow.
SQL is a transactional store, allowing the datastore to maintain consistency, albeit at the cost of occasional performance trade-offs.
Scalability may pose a bottleneck.

Our objective is to transition this SQL schema to a document-oriented database schema. It's important to note that we are not passing judgment on whether SQL is superior or inferior to its document-oriented counterpart. Our aim is to illustrate how one can migrate from one model to the other, highlighting the key aspects of each.

What Is a Document-Oriented Database ?

A document-oriented database is a type of NoSQL database that is designed to store, retrieve, and manage semi-structured or unstructured data, typically in the form of documents. Unlike traditional relational databases that store data in tables with predefined schemas, document-oriented databases store data in flexible, schema-less documents.

What is Azure Cosmos DB?

Azure Cosmos DB is a multi-model, globally distributed database service provided by Microsoft. It is part of Azure database offerings and is designed to enable developers to build highly responsive and scalable applications with low-latency access to data worldwide. Azure Cosmos DB supports multiple data models, including document, key-value, graph, column-family, and wide-column store, making it a multi-model database. Here, we intend to utilize only the document-oriented functionality of this database.
Azure Cosmos DB supports various data models, it is built to scale both throughput and storage horizontally, allowing you to scale resources up or down based on demand, automatically indexes all the data, providing efficient and high-performance query capabilities.

This is the business subject presented by Microsoft (certainly, it is not unfounded, and we will explore later the significant importance of these features and why Microsoft places a strong emphasis on them). However, in this article, we will depend on the crucial concept of partitioning, which will empower us to build a highly scalable and efficient application.

What Is Partitioning?

In Azure Cosmos DB, partitioning is a fundamental concept that involves distributing our data across multiple physical partitions to achieve scalability and performance.

A physical partition is the underlying storage structure in the Cosmos DB service. Each physical partition is a unit of physical resource allocation (compute, storage, and throughput). The goal is to evenly distribute the provisioned throughput and data storage across these physical partitions. As our data and throughput requirements increase, Azure Cosmos DB will automatically add more physical partitions to distribute the load.
A logical Partition is a unit of data and a container for a set of items that share the same partition key value. The partition key is a property chosen by the user when creating a container. It is used to determine the logical partition to which an item belongs. The choice of partition key is critical as it affects the distribution of data across physical partitions and can impact query performance.

Very Important 1

All items within a logical partition are guaranteed to be collocated on the same physical partition.

Very Important 2

Queries that span multiple partitions (cross-partition queries) may result in higher latency and consume more throughput because they require coordination across multiple physical partitions.

Give Me an Example !

All of that is somewhat cumbersome, tedious, and verbose, so let's simplify with a concise example. Imagine a situation where we need to store information about car models.

Cars will be stored within a container in Azure Cosmos DB, and it is imperative that we proactively specify the partition key during the container creation process. One viable option is to capitalize on attributes such as the car model for this purpose.

One or more logical partitions can be mapped to a single physical partition. A single physical partition can accommodate one or more logical partitions, and conversely, one or more logical partitions can be associated with a single physical partition. However, each individual logical partition will consistently correspond to one and only one physical partition.

Query by Car Model

Consider a scenario where our objective is to retrieve a specific car model. The query is straightforward, and our model was crafted to efficiently support it. In this case, only one logical partition will be accessed, ensuring high efficiency.

Only one logical partition is accessed.

Query by Country

Now, envision a situation where we aim to retrieve all the cars associated with a particular country, such as Japan. Given that the country attribute is distributed across multiple partitions, this scenario exemplifies a cross-partition query.

Several partitions are accessed: cross-partition query

Cross-partition queries are not inherently detrimental, and there may be situations where resorting to them is unavoidable. However, their frequent execution or involvement of numerous accessed partitions can adversely impact performance. It's crucial to exercise caution, especially considering that Azure Cosmos DB operates on a pricing model tied to CPU utilization—increased usage of computational resources correlates with higher costs.

These considerations lead us to the following general guideline.

Very Important

When designing our Cosmos DB database, we should carefully choose the right partition key based on our access patterns and query requirements. A good partition key distributes data evenly across partitions and minimizes cross-partition queries. Partitioning is a key feature that enables Azure Cosmos DB to provide elastic scalability and high-performance access to our data, making it suitable for globally distributed and highly demanding applications.

As developers, our responsibility will ultimately entail meticulously dividing data into well-considered partitions, thereby necessitating thoughtful design of containers.

But How to Do That?

Information

Certainly, as a developer, it's not always feasible to predict every potential query in advance. The dynamic nature of user interactions and requirements can introduce unforeseen queries.

Indeed, while it's impossible to anticipate every conceivable query or unforeseen need, proactive consideration of the most significant and frequently occurring ones is a pragmatic approach. Acknowledging that some less frequent queries may not be as efficient, users often understand this trade-off. In engineering, such trade-offs are recognized as compromises, reflecting the balancing act between optimizing for common use cases and maintaining flexibility for less frequent scenarios.

Absolutely, we touch here upon a fundamental aspect of document-oriented databases, and the NoSQL paradigm in general. Unlike relational databases where the structure is more rigid and schema-focused, in NoSQL databases, especially document-oriented ones, there's a need to think about queries upfront. The flexibility offered by NoSQL databases comes with the responsibility of designing data structures and choosing partition keys that align with the anticipated queries and usage patterns. This proactive consideration allows for better performance and scalability in handling diverse data access needs.

In Summary

It's important to design our data model and queries with the partitioning strategy in mind to optimize performance.

And Concretely?

When contemplating the earlier-discussed scenarios, what should be our approach to partitioning? The decision hinges on identifying the most frequent query, which, in reality, is contingent on the specific application's requirements. Ultimately, the solution may involve creating two containers, each employing its own partition strategy—one for the car model and the other for the country. The application will be tasked with determining the appropriate container, streamlining the process without necessitating prolonged deliberation on our part.

Now, envision the scenario where we need to compile a list of all the dealers associated with each car model—such as identifying which dealer is selling a Renault Clio V or a Ford Mondeo. In this instance, it would be advantageous to utilize a container with a model-centric strategy. The good news is that we already have such a container in place! All that remains is to incorporate our dealers into this existing container.

Now, within the container partitioned by model, we find two types of entities: the specified cars and various dealers. However, the challenge lies in distinguishing between them. The solution is straightforward: by introducing a new property into each item.

Important

This is where the Microsoft commercial pitch becomes instrumental and completes the loop: enabling automatic indexing for ALL attributes facilitates swift querying within a specific partition (as in our case with the type property), while simultaneously offering the flexibility to store schemaless data in the same container, encompassing both dealers and cars.

Can We Do Better?

When incorporating dealers into our application, an alternative approach could involve directly associating each dealer with the corresponding car item.

This approach aligns with the document-oriented paradigm and can be considered a form of denormalization. While not inherently problematic, caution is warranted. If the number of dealers is substantial (which is likely to be the case in our example), adopting this method results in oversized items for each car, demanding significant bandwidth. Consequently, this approach is best suited for scenarios where the list of associated objects remains reasonably sized.

It's time to implement these concepts in practice: this point is discussed in detail here.

History

12^th December, 2023: Initial version