Intro to Modern Data Platforms

Characteristics of a Modern Data Platform

Definition

A modern data platform is a set of best practices that allow you to get the most out of your data with the least amount of platform management.

Scalable

A modern data platform is scalable both with its audience and with its data.

It needs to support a variety of use cases users and usage patterns.

This means having the ability to do Extract-Transform-Load (ETL) and transformations for data engineering, Queries for data analysis and experimentation and Machine Learning (ML) for data science.

A modern data platform also needs to be scalable with its data.

It can ingest a high volume of data from a variety of data sources at high velocity.

A cloud model is necessary for a modern data platform to operate at the scale needed for current business use cases.

Self-Service (Data Democratization)

A modern data platform is self-service.

It supports the idea of data democratization enabling everyone in an organization to comfortably work with data.

Data Democratization: There are a few key ideas that lead to data democratization.

Data discoverability, easily finding data that you’re looking for.

Data federation, combining data from multiple places into one data model.

Data virtualization, retrieving and manipulating data without knowing its format or where it’s stored. One specific benefit of data virtualization is that it allows analytics at higher speed and lower cost.

These three practices lead to democratized data and data democratization results in the enablement of Artificial Intelligence (AI) and Machine Learning (ML) analytics for users with minimal support from the I.T. team.

It should also be noted that while the data and analytics is accessible to more users across the business a modern data platform maintains the same level of security and data governance as a traditional data platform.

I.T. can assign roles to users that determine who can see what data and how much control they have over editing and updating the data sets as well

Agile Data Management

Agile Data management means having high data availability with most of the data stored in a data lake or data warehouse.

These structures separate storage and compute making storing a high volume of data relatively inexpensive.

Another part of agile data management is elastic and auto-scale consumption.

Which is the ability to only use resources that are necessary.

This allows your systems to handle a variable workload and is complemented by a pay-as-you-go model to reduce cost.

Use of the cloud is key in agile data management.

The cloud model itself already reduces management by removing the responsibility of hardware update.

It also introduces the opportunity to use platforms infrastructure and Software as a service (SaaS) to reduce management responsibilities.

Architecture of a Traditional Data Platform

Source

Where the data starts.

Staging Layer

In this architecture data moves from its source to a staging layer where the data is streamlined and goes through some quality checks.

Data Warehouse

It then moves into a data warehouse which is the central repository for integrated data.

Mart Layer

Then the data goes to a mart layer which stores a subsection of data from the data warehouse and is curated for use by a specific business function or user group.

Multi-Dimensional “Cubes”

Then the data is modeled and put into multi-dimensional cubes.

Reporting

The multi-dimensional data models are then used for reporting.

Disadvantages of a Traditional Data Platform

Hosted on Premise

Data itself is hosted on premise and is only structured data and has some connectivity limitations.

Limited Tooling

For analysis there is SQL querying and limited Extract-Transform-Load (ETL) tooling.

Analytics Occur Offline

The analytics if not all of them occur offline on tools like Statistical Package for the Social Sciences (SPSS) or Statistical Analysis System (SAS).

Limited Automation

The deployment processes are rigid and very manual.

Expensive to Manage

In terms of management of this architecture there is quite a bit of lead time for infrastructure setup, a database administrator needs to manage the environment and then of course there is a cost associated with data storage.

Architecture of a Modern Data Platform

Source

It also starts with a data source layer, but in this version the data is hosted in the cloud.

The data can be structured or unstructured and this architecture can handle a variety of data sources.

Data Lake

One path that the data will take is directly into the data lake.

Where it moves from raw to process to output as it gets cleaned up transformed and tested for quality.

Automated Analytics and Machine Learning models

The other path that the data can take from the source is directly into automated analytics and machine learning models. and there is an experimentation stage or sandbox to use before scaling up analysis.

this layer enables innovation and insights without incurring the cost of analytics at full scale.

Relationally Structured

The analyzed data can be fed into the data lake or can go directly to the next layer the relational layer where relationships are formed to create a data model.

Data Visualization

then the data model is used for data visualization.

Sometimes the relational layer can be skipped, and the data goes from the data lake to the reporting tool.

Governance

Throughout this whole process the data is governed using practices like DevOps data cataloging and infrastructure as code.

Machine Learning Operations within the Architecture

Deployment and Testing Each Step

Each of these steps goes through testing and deployment.

We practice code management automated testing and continuous integration and deployment to optimize the solution’s performance and accuracy.

Data Ingestion and Processing

Like the larger architecture it starts with data ingestion and processing.

Where we bring in the data and prepare for analysis.

Orchestration

We then move to the orchestration step.

A Machine Learning (ML) solution typically has more than one step with multiple models making up one solution.

So an orchestration tool takes care of the multiple steps passes data between them and combines them for tracking.

Training

So then we have the training step.

We build the model based on historical data and update it to take account of new data periodically.

This ensures the model’s accuracy as behaviors change.

Scoring

Next we score the model.

In this step we load the trained model and use it to make predictions on what will happen in newer future cases.

This can be run for batches of new data individual examples as they come in or for what-if scenarios.

Tracking, Monitoring, Logging

Then we have the tracking monitoring and logging stage where we do some management of the model.

We ask questions like what models have been trained was the training successful which version of the data are they trained on what’s the estimated accuracy based on new data is it maintaining that accuracy.

Output and Presentation

Lastly there is the output and presentation of the model.

Outputs need to be presented to users in a wider data context so they can see the whole picture and confidently make business decisions based on the best view of the data.

10 v’s of Big Data

Summary

A modern data platform facilitates and manages what are called the 10 v’s of big data.

Additionally a modern data platform automates a lot of these processes which increases the efficiency of your work and minimizes the amount of platform management you need to do

the modern data platform is built to ingest whatever data you need ensure its quality and security and deliver powerful insights so that you can make data data-driven decisions for your business.

Volume Velocity and Variety (Most Significant)

The 3 most significant of these are volume velocity and variety.

a cloud-based platform allows for more data from a diverse data sources to be ingested at greater speeds

Variability, Veracity and Validity (Credibility)

then we have variability veracity and validity, which all refer to the credibility of the data

The auto scalability of a cloud platform easily handles variations in workloads while reducing overall operating costs

it also can seamlessly integrate with data governance tools that can track data modification and has dedicated layers for cleaning the data through anomaly detection data quality checks or other data governance practices

Vulnerability (Security)

as mentioned earlier a modern data platform maintains the same level of security as traditional data platforms

Volatility (Availability))

in terms of volatility a modern data platform can establish rules for which data is available and make sure that the information can be retrieved quickly when needed

Visualization (Reporting)

next comes visualization a modern data platform effortlessly integrates with data visualization tools like power Business Intelligence (BI) and tableau

these tools have a range of visualizations that provide deeper insights than traditional graphs

Value (Prediction)

a lot of the value derived from data comes from predictive analytics using Artificial Intelligence (AI) and Machine Learning (ML)

an established cloud-based platform is necessary for these analytics to be run.

Get in touch

If you’d like to learn more about how we can help you leverage the latest technologies to make timely, data-driven business decisions, we’d love to hear from you.

Characteristics of a Modern Data Platform

Definition

Scalable

Self-Service (Data Democratization)

Agile Data Management

Architecture of a Traditional Data Platform

Source

Staging Layer

Data Warehouse

Mart Layer

Multi-Dimensional “Cubes”

Reporting

Disadvantages of a Traditional Data Platform

Hosted on Premise

Limited Tooling

Analytics Occur Offline

Limited Automation

Expensive to Manage

Architecture of a Modern Data Platform

Source

Data Lake

Automated Analytics and Machine Learning models

Relationally Structured

Data Visualization

Governance

Machine Learning Operations within the Architecture

Deployment and Testing Each Step

Data Ingestion and Processing

Orchestration

Training

Scoring

Tracking, Monitoring, Logging

Output and Presentation

10 v’s of Big Data

Summary

Volume Velocity and Variety (Most Significant)

Variability, Veracity and Validity (Credibility)

Vulnerability (Security)

Volatility (Availability))

Visualization (Reporting)

Value (Prediction)

Intro to Cloud

Intro to Google Cloud Platform

Intro to Modern Data Platforms

Intro to Machine Learning

Get in touch