Skip to content

Commit

Permalink
GITBOOK-2: No subject
Browse files Browse the repository at this point in the history
  • Loading branch information
remigu authored and gitbook-bot committed Jul 11, 2024
1 parent 530325c commit f75280b
Show file tree
Hide file tree
Showing 32 changed files with 1,291 additions and 0 deletions.
Binary file added .gitbook/assets/cluster attach.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/compute page.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/data icon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/download (2).png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/lakehouse monitoring overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/lakehouse monitoring.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/marketplace home.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/marketplace.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/mlruntime dbr dropdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .gitbook/assets/object model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions .gitbook/assets/workspace
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<!doctype html><html lang="en"><head><meta charset="UTF-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width,initial-scale=1"/><style>body{margin:0;font-size:13px;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,'Helvetica Neue',Arial,'Noto Sans',sans-serif,'Apple Color Emoji','Segoe UI Emoji','Segoe UI Symbol','Noto Color Emoji';font-variant:tabular-nums;line-height:1.5715}</style><title>Databricks REST API reference</title><base href="/api/"/><script>window.dataLayer=window.dataLayer||[]</script><script>!function(e,t,a,n,g){e[n]=e[n]||[],e[n].push({"gtm.start":(new Date).getTime(),event:"gtm.js"});var m=t.getElementsByTagName(a)[0],r=t.createElement(a);r.async=!0,r.src="https://www.googletagmanager.com/gtm.js?id=GTM-T85FQ33",m.parentNode.insertBefore(r,m)}(window,document,"script","dataLayer")</script><script>!function(e,t,a,n,g){e[n]=e[n]||[],e[n].push({"gtm.start":(new Date).getTime(),event:"gtm.js"});var m=t.getElementsByTagName(a)[0],r=t.createElement(a);r.async=!0,r.src="https://www.googletagmanager.com/gtm.js?id=GTM-TWTKQQ",m.parentNode.insertBefore(r,m)}(window,document,"script","dataLayer")</script><link rel="icon" href="favicon.ico"><script defer="defer" src="static/js/main.54091a17.js"></script><link href="static/css/main.b1386010.css" rel="stylesheet"></head><body><div style="height:100vh" id="root"></div></body></html>
56 changes: 56 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
cover: .gitbook/assets/download (2).png
coverY: 0
layout:
cover:
visible: true
size: hero
title:
visible: true
description:
visible: true
tableOfContents:
visible: true
outline:
visible: true
pagination:
visible: true
---

# Databricks on AWS

Databricks documentation provides how-to guidance and reference information for data analysts, data scientists, and data engineers solving problems in analytics and AI. The Databricks Data Intelligence Platform enables data teams to collaborate on data stored in the lakehouse. See [What is a data lakehouse?](https://docs.databricks.com/en/lakehouse/index.html)



### Try Databricks

* [Get a free trial & set up](https://docs.databricks.com/en/getting-started/index.html)
* [Query and visualize data from a notebook](https://docs.databricks.com/en/getting-started/quick-start.html)
* [Import and visualize CSV data from a notebook](https://docs.databricks.com/en/getting-started/import-visualize-data.html)
* [Build a basic ETL pipeline](https://docs.databricks.com/en/getting-started/etl-quick-start.html)
* [Build a simple lakehouse analytics pipeline](https://docs.databricks.com/en/getting-started/lakehouse-e2e.html)
* [Free training](https://docs.databricks.com/en/getting-started/free-training.html)

### What do you want to do?

* [Data science & engineering](https://docs.databricks.com/en/workspace-index.html)
* [Machine learning](https://docs.databricks.com/en/machine-learning/index.html)
* [SQL queries & visualizations](https://docs.databricks.com/en/sql/index.html)

### Manage Databricks

* [Account & workspace administration](https://docs.databricks.com/en/admin/index.html)
* [Security & compliance](https://docs.databricks.com/en/security/index.html)
* [Data governance](https://docs.databricks.com/en/data-governance/index.html)

### Reference Guides

* [API reference](https://docs.databricks.com/en/reference/api.html)
* [SQL language reference](https://docs.databricks.com/en/sql/language-manual/index.html)
* [Error handling and error messages](https://docs.databricks.com/en/error-messages/index.html)

### Resources

* [Release notes](https://docs.databricks.com/en/release-notes/index.html)
* [Other resources](https://docs.databricks.com/en/resources/index.html)
32 changes: 32 additions & 0 deletions SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Table of contents

## Get started

* [Databricks on AWS](README.md)
* [What is Databricks?](get-started/what-is-databricks.md)
* [DatabricksIQ-powered features](get-started/databricksiq-powered-features.md)
* [Databricks release notes](get-started/databricks-release-notes.md)

## Advanced

* [Connect to data sources](advanced/connect-to-data-sources.md)
* [Databricks documentation](advanced/databricks-documentation.md)
* [Database objects in Databricks](advanced/database-objects-in-databricks.md)
* [Get started: Account and workspace setup](advanced/get-started-account-and-workspace-setup.md)
* [Compute](advanced/compute.md)

## Learn more

* [Discover data](learn-more/discover-data.md)

***

* [Ingest data into a Databricks lakehouse](ingest-data-into-a-databricks-lakehouse.md)
* [Query data](query-data.md)
* [Transform data](transform-data.md)
* [Introduction to Databricks Lakehouse Monitoring](introduction-to-databricks-lakehouse-monitoring.md)
* [Databricks data engineering](databricks-data-engineering.md)
* [What is Databricks Marketplace?](what-is-databricks-marketplace.md)
* [Share data and AI assets securely using Delta Sharing](share-data-and-ai-assets-securely-using-delta-sharing.md)
* [Generative AI and large language models (LLMs) on Databricks](generative-ai-and-large-language-models-llms-on-databricks.md)
* [AI and Machine Learning on Databricks](ai-and-machine-learning-on-databricks.md)
63 changes: 63 additions & 0 deletions advanced/compute.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Compute

Databricks compute refers to the selection of computing resources available in the Databricks workspace. Users need access to compute to run data engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.

Users can either connect to existing compute or create new compute if they have the proper permissions.

You can view the compute you have access to using the **Compute** section of the workspace:

![All-purpose compute page in Databricks workspace](<../.gitbook/assets/compute page.png>)

### Types of compute

These are the types of compute available in Databricks:

* **Serverless compute for notebooks (Public Preview)**: On-demand, scalable compute used to execute SQL and Python code in notebooks.
* **Serverless compute for workflows (Public Preview)**: On-demand, scalable compute used to run your Databricks jobs without configuring and deploying infrastructure.
* **All-Purpose compute**: Provisioned compute used to analyze data in notebooks. You can create, terminate, and restart this compute using the UI, CLI, or REST API.
* **Job compute**: Provisioned compute used to run automated jobs. The Databricks job scheduler automatically creates a job compute whenever a job is configured to run on new compute. The compute terminates when the job is complete. You _cannot_ restart a job compute. See Use Databricks compute with your jobs.
* **Instance pools**: Compute with idle, ready-to-use instances, used to reduce start and autoscaling times. You can create this compute using the UI, CLI, or REST API.
* **Serverless SQL warehouses**: On-demand elastic compute used to run SQL commands on data objects in the SQL editor or interactive notebooks. You can create SQL warehouses using the UI, CLI, or REST API.
* **Classic SQL warehouses**: Used to run SQL commands on data objects in the SQL editor or interactive notebooks. You can create SQL warehouses using the UI, CLI, or REST API.

The articles in this section describe how to work with compute resources using the Databricks UI. For other methods, see What is the Databricks CLI? and the [Databricks REST API reference](../.gitbook/assets/workspace).

### Databricks Runtime

Databricks Runtime is the set of core components that run on your compute. The Databricks Runtime is a configurable setting in all-purpose of jobs compute but autoselected in SQL warehouses.

Each Databricks Runtime version includes updates that improve the usability, performance, and security of big data analytics. The Databricks Runtime on your compute adds many features, including:

* Delta Lake, a next-generation storage layer built on top of Apache Spark that provides ACID transactions, optimized layouts and indexes, and execution engine improvements for building data pipelines. See What is Delta Lake?.
* Installed Java, Scala, Python, and R libraries.
* Ubuntu and its accompanying system libraries.
* GPU libraries for GPU-enabled clusters.
* Databricks services that integrate with other components of the platform, such as notebooks, jobs, and cluster management.

For information about the contents of each runtime version, see the release notes.

#### Runtime versioning

Databricks Runtime versions are released on a regular basis:

* **Long Term Support** versions are represented by an **LTS** qualifier (for example, **3.5 LTS**). For each major release, we declare a “canonical” feature version, for which we provide three full years of support. See Databricks support lifecycles for more information.
* **Major** versions are represented by an increment to the version number that precedes the decimal point (the jump from 3.5 to 4.0, for example). They are released when there are major changes, some of which may not be backwards-compatible.
* **Feature** versions are represented by an increment to the version number that follows the decimal point (the jump from 3.4 to 3.5, for example). Each major release includes multiple feature releases. Feature releases are always backward compatible with previous releases within their major release.

### What is Serverless Compute?

Serverless compute enhances productivity, cost efficiency, and reliability in the following ways:

* **Productivity**: Cloud resources are managed by Databricks, reducing management overhead and providing instant compute to enhance user productivity.
* **Efficiency**: Serverless compute offers rapid start-up and scaling times, minimizing idle time and ensuring you only pay for the compute you use.
* **Reliability**: With serverless compute, capacity handling, security, patching, and upgrades are managed automatically, alleviating concerns about security policies and capacity shortages.

### What are Serverless SQL Warehouses?

Databricks SQL delivers optimal price and performance with serverless SQL warehouses. Key advantages of serverless warehouses over pro and classic models include:

* **Instant and elastic compute**: Eliminates waiting for infrastructure resources and avoids resource over-provisioning during usage spikes. Intelligent workload management dynamically handles scaling. See SQL warehouse types for more information on intelligent workload management and other serverless features.
* **Minimal management overhead**: Capacity management, patching, upgrades, and performance optimization are all handled by Databricks, simplifying operations and leading to predictable pricing.
* **Lower total cost of ownership (TCO)**: Automatic provisioning and scaling of resources as needed helps avoid over-provisioning and reduces idle times, thus lowering TCO.

***
42 changes: 42 additions & 0 deletions advanced/connect-to-data-sources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Connect to data sources

This article provides opinionated recommendations for how administrators and other power users can configure connections between Databricks and data sources. If you are trying to determine whether you have access to read data from an external system, start by reviewing the data that you have access to in your workspace. See [Discover data](broken-reference).

You can connect your Databricks account to data sources such as cloud object storage, relational database management systems, streaming data services, and enterprise platforms such as CRMs. The specific privileges required to configure connections depends on the data source, how permissions in your Databricks workspace are configured, the required permissions for interacting with data in the source, your data governance model, and your preferred method for connecting.

Most methods require elevated privileges on both the data source and the Databricks workspace to configure the necessary permissions to integrate systems. Users without these permissions should request help. See [Request access to data sources](broken-reference).

### Configure connections to external data systems

Databricks recommends several options for configuring connections to external data systems depending on your needs. The following table provides a high-level overview of these options:

| Option | Description |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Lakehouse Federation | Provides read-only access to data in enterprise data systems. Connections are configured through Unity Catalog at the catalog or schema level, syncing multiple tables with a single configuration. See What is Lakehouse Federation. |
| Partner Connect | Leverages technology partner solutions to connect to external data sources and automate ingesting data to the lakehouse. Some solutions also include reverse ETL and direct access to lakehouse data from external systems. See What is Databricks Partner Connect? |
| Drivers | Databricks includes drivers for external data systems in each Databricks Runtime. You can optionally install third-party drivers to access data in other systems. You must configure connections for each table. Some drivers include write access. See Connect to external systems. |
| JDBC | Several included drivers for external systems build upon native JDBC support, and the JDBC option provides extensible options for configuring connections to other systems. You must configure connections for each table. See Query databases using JDBC. |

### Connect to streaming data sources

Databricks provides optimized connectors for many streaming data systems.

For all streaming data sources, you must generate credentials that provide access and load these credentials into Databricks. Databricks recommends storing credentials using secrets, because you can use secrets for all configuration options and in all access modes.

All data connectors for streaming sources support passing credentials using options when you define streaming queries. See Configure streaming data sources.

### Request access to data sources

In many organizations, most users do not have sufficient privileges on either Databricks or external data sources to configure data connections.

Your organization might have already configured access to a data source using one of the patterns described in the articles linked from this page. If your organization has a well-defined process for requesting access to data, Databricks recommends following that process.

If you’re uncertain how to gain access to a data source, this procedure might help you:

1. Use Catalog Explorer to view the tables and volumes that you can access. See What is Catalog Explorer?.
2. Ask your teammates or managers about the data sources that they can access.
* Most organizations use groups synced from their identity provider (for example: Okta or Microsoft Entra ID (formerly Azure Active Directory)) to manage permissions for workspace users. If other members of your team can access data sources that you need access to, have a workspace admin add you to the correct group to grant you access.
* If a particular table, volume, or data source was configured by a co-worker, that individual should have permissions to grant you access to the data.
3. Some organizations configure data access permissions through settings on compute clusters and SQL warehouses.
* Access to data sources can vary by compute.
* You can view the compute creator on the **Compute** tab. Reach out to the creator to ask about data sources that should be accessible.
Loading

0 comments on commit f75280b

Please sign in to comment.