Yardi is a company that specializes in providing real estate investment and property management software solutions. They offer a range of products and services tailored to the real estate industry, catering to various sectors including residential, commercial, and multifamily property management, investment management, and real estate investment trusts (REITs).
For the past two decades, organizations have used the extract, transform, load (ETL) approach to provide business users with clean and ready-to-analyze data. These ETL solutions move data from application databases to staging areas where data is then profiled, cleansed, cataloged, transformed, aggregated, governed and curated before being loaded into data warehouses for access by end-users.
Cloud-based data warehouse architecture is relatively new when compared to legacy options. This data warehouse architecture means that the actual data warehouses are accessed through the cloud. There are several cloud based data warehouses options, each of which has different architectures for the same benefits of integrating, analyzing, and acting on data from different sources.
The difference between a cloud-based data warehouse approach compared to that of a traditional approach include:
Up-front costs: The different components required for traditional, on-premises data warehouses mandate pricey up-front expenses. Since the components of cloud architecture are accessed through the cloud, these expenses don’t apply.
Ongoing costs: While businesses with on-prem data warehouses must deal with upgrade and maintenance costs, the cloud offers a low, pay-as-you-go model.
Speed: Cloud-based data warehouse architecture is substantially speedier than on-premises options,
partly due to the use of ELT — which is an uncommon process for on-premises counterparts.
Flexibility: Cloud data warehouses are designed to account for the variety of formats and structures found in big data. Traditional relational options are designed simply to integrate similarly structured data.
Scale: The elastic resources of the cloud make it ideal for the scale required of big datasets. Additionally, cloud-based data warehousing options can also scale down as needed, which is difficult to do with other approaches.
Some of the more notable cloud data warehouses in the market include Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure SQL Data Warehouse.
Amazon Redshift architecture overview
Redshift leverages the massively parallel processing architecture in which nodes store data in slices via a columnar format. Each node has their own storage, RAM, and compute power. The main types of nodes are leader and compute nodes; the former intakes queries and assigns them to compute nodes to perform the queries.
Because compute nodes can process data in different slices at the same time, Redshift has robust query performance. Compute nodes return the results to leader nodes, which aggregate them for client-side applications. Users can connect directly to Redshift with an assortment of BI or analytics tools to query the data directly where it lives.
Google BigQuery architecture overview
Google BigQuery relies on a serverless architecture in which different machines are used by the provider to manage resources. Consequently, clients aren’t involved in the resource management process. BigQuery’s architecture supports both traditional data loading and data streaming, the latter of which is designed for ingesting data in real-time.
The main architectural component for this cloud data warehouse is Dremel, a massively parallel query engine capable of reading hundreds of millions of rows in seconds. With this approach, data is actually stored in a file management system called Colossus that puts the data in clusters made up of different nodes. The files are distributed in 64 megabyte amounts in a columnar format. Queries are issued from a tree architecture among the different machines the data is stored in, helping with the quick response times.
Snowflake data warehouse architecture
Snowflake’s architecture is similar to Redshift because it also uses a cluster and node approach. The main architectural difference with Snowflake is that the compute capabilities are separate from the storage, producing a few important advantages. The main benefit? The storage location changes depending on whether or not users require computing at the moment.
If the user doesn’t need computation, the data is tiered (meaning moved to) another storage area that is less costly, since that storage area is not used for data computation. Also, separating storage from compute enables Redshift’s architecture to easily scale up and down as needed, well beyond the capacity of on-premises data warehouses. Additionally, the components for data ingestion and analysis are integrated with the storage component.
Microsoft Azure SQL data warehouse architecture
Similar to Redshift’s architecture, the cloud based data warehouse architecture of Microsoft Azure is based on massively parallel processing. Data is stored in relational databases that, because of this architecture, run fast SQL queries of enormous complexity. Additional tools in the Azure cloud ecosystem allow users to create automated pipelines for transmitting, processing and storing data at petabyte scale.