Skip to content

Latest commit

 

History

History
 
 

sqlserver2019bigdataclusters

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Workshop: SQL Server Big Data Clusters - Architecture

A Microsoft Course from the SQL Server team

About this Workshop
Business Applications of this Workshop
Technologies used in this Workshop
Before Taking this Workshop
Workshop Details
Related Workshops
Workshop Modules
Next Steps

Welcome to this Microsoft solutions workshop on the architecture on SQL Server Big Data Clusters. In this workshop, you'll learn how SQL Server Big Data Clusters (BDC) implements large-scale data processing and machine learning, and how to select and plan for the proper architecture to enable machine learning to train your models using Python, R, Java or SparkML to operationalize these models, and how to deploy your intelligent apps side-by-side with their data.

The focus of this workshop is to understand how to deploy an on-premises or local environment of a big data cluster, and understand the components of the big data solution architecture.

You'll start by understanding the concepts of big data analytics, and you'll get an overview of the technologies (such as containers, container orchestration, Spark and HDFS, machine learning, and other technologies) that you will use throughout the workshop. Next, you'll understand the architecture of a BDC. You'll learn how to create external tables over other data sources to unify your data, and how to use Spark to run big queries over your data in HDFS or do data preparation. You'll review a complete solution for an end-to-end scenario, with a focus on how to extrapolate what you have learned to create other solutions for your organization.

This README.MD file explains how the workshop is laid out, what you will learn, and the technologies you will use in this solution.

You can view all of the source files for this workshop on this GitHub site, along with other workshops as well. Open this link in a new tab to find out more.

Learning Objectives

In this workshop you'll learn:

  • When to use Big Data technology
  • The components and technologies of Big Data processing
  • Abstractions such as Containers and Container Management as they relate to SQL Server and Big Data
  • Planning and architecting an on-premises, in-cloud, or hybrid big data solution with SQL Server
  • How to install SQL Server big data clusters on-premises and in the Azure Kubernetes Service (AKS)
  • How to work with Apache Spark
  • The Data Science Process to create an end-to-end solution
  • How to work with the tooling for BDC (Azure Data Studio)
  • Monitoring and managing the BDC
  • Security considerations

Starting in SQL Server 2019, big data clusters allows for large-scale, near real-time processing of data over the HDFS file system and other data sources. It also leverages the Apache Spark framework which is integrated into one environment for management, monitoring, and security of your environment. This means that organizations can implement everything from queries to analysis to Machine Learning and Artificial Intelligence within SQL Server, over large-scale, heterogeneous data. SQL Server big data clusters can be implemented fully on-premises, in the cloud using a Kubernetes service such as Azure's AKS, and in a hybrid fashion. This allows for full, partial, and mixed security and control as desired.

The goal of this workshop is to train the team tasked with architecting and implementing SQL Server big data clusters in the planning, creation, and delivery of a system designed to be used for large-scale data analytics. Since there are multiple technologies and concepts within this solution, the workshop uses multiple types of exercises to prepare the students for this implementation.

The concepts and skills taught in this workshop form the starting points for:

  • Data Professionals and DevOps teams, to implement and operate a SQL Server big data cluster system.
  • Solution Architects and Developers, to understand how to put together an end-to-end solution.
  • Data Scientists, to understand the environment used to analyze and solve specific predictive problems.

Businesses require near real-time insights from ever-larger sets of data from a variety of sources. Large-scale data ingestion requires scale-out storage and processing in ways that allow fast response times. In addition to simply querying this data, organizations want full analysis and even predictive capabilities over their data.

Some industry examples of big data processing are in Retail (Demand Prediction, Market-Basket Analysis), Finance (Fraud detection, customer segmentation), Healthcare (Fiscal control analytics, Disease Prevention prediction and classification, Clinical Trials optimization), Public Sector (Revenue prediction, Education effectiveness analysis), Manufacturing (Predictive Maintenance, Anomaly Detection) and Agriculture (Food Safety analysis, Crop forecasting) to name just a few.

The solution includes the following technologies - although you are not limited to these, they form the basis of the workshop. At the end of the workshop you will learn how to extrapolate these components into other solutions. You will cover these at an overview level, with references to much deeper training provided.

Technology Description
LinuxOperating system used in Containers and Container Orchestration
ContainersEncapsulation level for the SQL Server big data cluster architecture
Conainer Orechestration (such as Kubernetes)Management, control plane for Containers
Microsoft AzureCloud environment for services
Azure Kubernetes Service (AKS)Kubernetes as a Service
Apache HDFSScale-out storage subsystem
Apache KnoxThe Knox Gateway provides a single access point for all REST interactions, used for security
Apache LivyJob submission system for Apache Spark
Apache SparkIn-memory large-scale, scale-out data processing architecture used by SQL Server
Python, R, Java, SparkMLML/AI programming languages used for Machine Learning and AI Model creation
Azure Data StudioTooling for SQL Server, HDFS, Big Data cluster management, T-SQL, R, Python, and SparkML languages
SQL Server Machine Learning ServicesR, Python and Java extensions for SQL Server
Microsoft Data Science Process (TDSP)Project, Development, Control and Management framework
Monitoring and ManagementDashboards, logs, API's and other constructs to manage and monitor the solution
SecurityRBAC, Keys, Secrets, VNETs and Compliance for the solution

Condensed Lab: If you have already completed the pre-requisites for this course and are familiar with the technologies listed above, you can jump to a Jupyter Notebooks-based tutorial located here. Load these with Azure Data Studio, starting with bdc_tutorial_00.ipynb.

You'll need a local system that you are able to install software on. The workshop demonstrations use Microsoft Windows as an operating system and all examples use Windows for the workshop. Optionally, you can use a Microsoft Azure Virtual Machine (VM) to install the software on and work with the solution.

You must have a Microsoft Azure account with the ability to create assets, specifically the Azure Kubernetes Service (AKS).

This workshop expects that you understand data structures and working with SQL Server and computer networks. This workshop does not expect you to have any prior data science knowledge, but a basic knowledge of statistics and data science is helpful in the Data Science sections. Knowledge of SQL Server, Azure Data and AI services, Python, and Jupyter Notebooks is recommended. AI techniques are implemented in Python packages. Solution templates are implemented using Azure services, development tools, and SDKs. You should have a basic understanding of working with the Microsoft Azure Platform.

If you are new to these, here are a few references you can complete prior to class:

Setup

A full prerequisites document is located here. These instructions should be completed before the workshop starts, since you will not have time to cover these in class. Remember to turn off any Virtual Machines from the Azure Portal when not taking the class so that you do incur charges (shutting down the machine in the VM itself is not sufficient).

This workshop uses Azure Data Studio, Microsoft Azure AKS, and SQL Server (2019 and higher) with a focus on architecture and implementation.

Primary Audience:System Architects and Data Professionals tasked with implementing Big Data, Machine Learning and AI solutions
Secondary Audience: Security Architects, Developers, and Data Scientists
Level: 300
Type:In-Person
Length: 8-9 hours

This is a modular workshop, and in each section, you'll learn concepts, technologies and processes to help you complete the solution.

ModuleTopics
01 - The Big Data Landscape Overview of the workshop, problem space, solution options and architectures
02 - SQL Server BDC Components Abstraction levels, frameworks, architectures and components within SQL Server big data clusters
03 - Planning, Installation
and Configuration
Mapping the requirements to the architecture design, constraints, and diagrams
04 - Operationalization Connecting applications to the solution; DDL, DML, DCL
05 - Management and
Monitoring
Tools and processes to manage the big data cluster
06 - Security Access and Authentication to the various levels of the solution

Next Steps

Next, Continue to prerequisites