DEV Community: Chaos Genius

HOW TO: use Hoppscotch.io to interact with Snowflake API ❄️+🛸

Pramit Marattha — Tue, 25 Jul 2023 06:29:23 +0000

Snowflake provides a robust REST API that allows you to programmatically access and manage your Snowflake data. Using the Snowflake API, you can build applications and workflows to query data, load data, create resources—and more—all via API calls. But working with APIs can be tedious without the right tools. That's where Hoppscotch comes in. Hoppscotch is an open-source API development ecosystem that makes it easy to build, test and share APIs. It provides a GUI for creating and editing requests, as well as a variety of features for debugging and analyzing responses.

In this article, we'll explore how Hoppscotch's slick GUI and automation features can help you tap into the power of Snowflake API. We will delve into the intricacies of executing a SQL statement with the Snowflake API and creating and automating an entire Snowflake API workflow in Hoppscotch.

Let's dive in and unlock the versatility of robust Snowflake API ❄️ with Hoppscotch 🛸!

Prerequisites for Snowflake + Hoppscotch integration (❄️+ 🛸)

The prerequisites for integrating Snowflake and Hoppscotch are as follows:

Snowflake Account: You need to have a Snowflake account with an accessible warehouse, database, schema, and role, which means you should have the necessary permissions to access and manage these resources in Snowflake.
SnowSQL Installation: SnowSQL, a command-line client for Snowflake, needs to be installed on your system. To install SnowSQL, visit the Snowflake website and download the appropriate version for your operating system. Follow the installation instructions specific to your system, and then proceed to configure SnowSQL.
Key-Pair Authentication: A working key-pair authentication is required. This is a method of authentication that uses a pair of keys, one private and one public, for secure communication.
Hoppscotch Account: You have the option to sign up for a free account; although it is not mandatory, as it can be used without the need for doing so. Hoppscotch is a popular open source API client that allows you to build, test, and document APIs for absolutely free.

After setting up these prerequisites, you will be able to configure Hoppscotch and Snowflake API, perform simple queries, use Hoppscotch to fetch/store data, and create/automate an entire Snowflake API workflow.

Getting Started with Snowflake API in Hoppscotch

To begin our journey of integrating the Snowflake API with Hoppscotch, let's take a moment to familiarize ourselves with Hoppscotch. Once we have a clear understanding, we can proceed to log in to Hoppscotch, configure the workspace, create a collection, and tailor it to suit our specific requirements.

Let's get started!!

What is Hoppscotch?

Hoppscotch, a fully open-source API development ecosystem, is the brainchild of Liyas Thomas and a team of dedicated open-source contributors. This innovative tool lets users test APIs directly from their browser, eliminating the need to juggle multiple applications.

But Hoppscotch is more than just a convenience tool. It's a feature-packed powerhouse that offers custom themes, WebSocket communication, GraphQL testing, user authentications, API request history, proxy, API documentation, API collections—and so much more!

Hoppscotch also integrates seamlessly with GitHub and Google accounts, allowing users to save and sync their history, collections, and environment. Its compatibility extends to a wide range of browsers and devices, and it can even be installed as a Progressive Web App (PWA).

Now that we have a clear understanding of what Hoppscotch is, let's begin the step-by-step process to log in, create a workspace, and establish a collection within the platform.

Setting up Hoppscotch + Configuring Workspace/Collection

Step 1: Head over to hoppscotch.io. You can use Hoppscotch without an account, but you'll need one to save workspaces. To create an account, click "Signup" and follow the registration process. If you already have an account, simply login. Otherwise, feel free to start using Hoppscotch without logging in.

Step 2: Once logged in, your next task is to create a Collection. For this guide, we'll be creating a Collection named “Snowflake API” within Hoppscotch. This is a straightforward process, all you have to do is click on “Create Collection” button and enter the desired name.

Step 3: The next step involves editing the environment within Hoppscotch. This can be done in two ways: you can either import an existing environment or manually input the variables and their corresponding values. This is crucial as it sets up the parameters for your workspace.

Step 4: If you choose to import the list of variables, click on that box menu on the right-hand side of the interface. Clicking on this will open up the import options.

Step 5: The following step involves creating a JSON file with the necessary variables. Copy the code provided below and save it as a JSON file. Be sure to name the file appropriately for easy identification.

[
  {
    "name": "Collection Variables",
    "variables": [
      {
        "key": "baseUrl",
        "value": "https://*acc_locator*.snowflakecomputing.com/api/v2"
      },
      {
        "key": "tokenType",
        "value": "KEYPAIR_JWT"
      },
      {
        "key": "token",
        "value": "generate-token"
      },
      {
        "key": "agent",
        "value": "myApplication/1.0"
      },
      {
        "key": "uuid",
        "value": "uuid"
      },
      {
        "key": "statementHandle",
        "value": "statement-handle"
      }
    ]
  }
]

baseUrl: This is the base URL fpr the Snowflake API. The acc_locator* should be replaced with the account locator for your specific Snowflake account.
tokenType: This should be set to KEYPAIR_JWT to indicate you are using a keypair for authentication.
token: This will contain the actual JWT token used to authenticate requests.
Agent: This is a name and a version for the application making the request
Uuid: This is the unique identifier for the application/user making the request.
statementHandle: This is an identifier returned by Snowflake when a SQL statement is executed. It can be used to get the status/result of the statement.

Step 6: With your JSON file ready, return to Hoppscotch and click on 'Import'. Navigate to the location of your saved JSON file and select it for import. This will populate your environment with the variables from the file.

Step 7: Now, you'll need to select the environment you've just created. To do this, click on the 'Environment' option located at the top of the interface and select the environment you've just populated.

Boom!! you've successfully set up your Hoppscotch workspace. You're now ready to proceed with Snowflake API configuration.

Understanding the Snowflake API

Now, let's delve into understanding the Snowflake API. The very first step in this process involves updating the baseURL environment variable. This can be found under the Variables tab within your Snowflake API settings. You'll need to replace the existing value with your unique Snowflake account locator. This account locator serves as a unique identifier for your Snowflake account.

The URL should be formatted as follows:

https://<account***********locator>.snowflakecomputing.com

Note: The account locator might include additional segments for your region and cloud provider.

Snowflake API is primarily composed of the /api/v2/statements/ resource, which provides several endpoints. Let's explore these endpoints in more detail:

1) /api/v2/statements

This endpoint is used to submit a SQL statement for execution. You can send a POST request to /api/v2/statements.

Request Syntax:

POST /api/v2/statements
(request body)

For a more comprehensive understanding of the POST /api/v2/statements Snowflake API documentation

2) /api/v2/statements/`{{statementHandle}}`

This endpoint is designed to check the status of a statement's execution. The {{statementHandle}} is a placeholder for the unique identifier of the SQL statement that you have submitted for execution. To check the status, send a GET request to /api/v2/statements/{statementHandle}. If the statement has been executed successfully, the body of the response will include a ResultSet object containing the requested data.

Request Syntax:

GET /api/v2/statements/{statementHandle}

For a more in-depth understanding the GET /api/v2/statements/{statementHandle} Snowflake API documentation

3) /api/v2/statements/`{{statementHandle}}`/cancel

This endpoint is used to cancel the execution of a statement. Again, {{statementHandle}} is a placeholder for the unique identifier of the SQL statement. By using this endpoint, you can submit SQL statements to your Snowflake account, check their status, and cancel them if necessary, all programmatically through the API.

Request Syntax:

POST /api/v2/statements/{statementHandle}/cancel

For a more comprehensive understanding of the POST /api/v2/statements/{statementHandle}/cancel endpoint, refer to this Snowflake API documentation

Step by Step guide to Authorizing Snowflake API Requests

Authorizing Snowflake API is extremely crucial to ensure that only authorized users can access and manipulate data. There are two methods of authorization: OAuth and JWT key pair authorization. You can choose the method that best suits your needs but in this article we will focus on JWT key pair authorization.

Using JWT key pair authorization

Before we delve into the process, make sure that you have successfully set up key pair authentication with Snowflake.

Step 1: Open a terminal window and generate a private key. Please make sure that OpenSSL is installed on your system before proceeding.

Step 2: Now, you have the option to generate either an encrypted or an unencrypted version of the private key.

To generate an unencrypted version of the private key, use the following command:

openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out snowflake_rsa_key.p8 -nocrypt

If you prefer to generate an encrypted version of the private key, use the following command (which omits “-nocrypt”):

openssl genrsa 2048 | openssl pkcs8 -topk8 -v2 des3 -inform PEM -out snowflake_rsa_key.p8

Both commands generate a private key in PEM format.

-----BEGIN ENCRYPTED PRIVATE KEY-----
MIIE6TAbBgkqhkiG9w0BBQMwDgQILYPyCppzOwECAggABIIEyLiGSpeeGSe3xHP1
....
....
....
....
....
-----END ENCRYPTED PRIVATE KEY-----

Step 3: Next, generate the public key by referencing the private key from the command line. The command assumes the private key is encrypted and contained in the file named snowflake_rsa_key.p8.

openssl rsa -in snowflake_rsa_key.p8 -pubout -out someflake_rsa_key.pub

This command generates the public key in PEM format.

-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAy+Fw2qv4Roud3l6tj
....
....
....
-----END PUBLIC KEY-----

Step 4: Once you have the public key, execute an ALTER USER command to assign the public key to a Snowflake user.

ALTER USER pramitdemo SET RSA_PUBLIC_KEY='M.......................';

Step 5: To verify the User’s Public Key Fingerprint, execute a DESCRIBE USER command.

DESCRIBE USER pramitdemo;

Step 6: Once Key Pair Authentication for your Snowflake account is set, a JWT token should be generated. This JWT token is a time-limited token that has been signed with your key. Snowflake will recognize that you authorized this token to be used to authenticate as you.

Here is the command to generate aJWT token using SnowSQL.

snowsql --generate-jwt -a kqmjdsh-vh19618 -u pramitdemo --private-key-path snowflake_rsa_key.p8sss

Using OAuth authorization

If you prefer to use OAuth for authentication, follow these steps:

Step 1: Set up OAuth for authentication. Refer to the Introduction to OAuth for details on how to set up OAuth and get an OAuth token.

Step 2: Use SnowSQL to verify that you can use the generated OAuth token to connect to Snowflake. The commands for Linux/MacOS and Windows are as follows:

For Linux/MacOS:

snowsql -aaccount_identifier> -u <user> --authenticator=oauth --token<oauth_token>

For Windows:

snowsql -a <account_identifier> -u <user> --authenticator=oauth --token<oauth_token>

In your Hoppscotch app, set the following headers in each API request:

Authorization: Bearer oauth_token, where oauth_token is the generated OAuth token.
X-Snowflake-Authorization-Token-Type: OAUTH
Snowflake-Account: account_locator (required if you are using OAuth with a URL that specifies the account name in an organization)

Note: You can choose to omit the X-Snowflake-Authorization-Token-Type header. If this header is not present, Snowflake assumes that the token in the Authorization header is an OAuth token.

Executing SQL Statements with the Snowflake API

Now, we've reached the most important part of the article, so let's go back to Hoppscotch.

Step 1: We'll start by updating the environment variable token in Hoppscotch with the generated token for authentication.

The generated JWT (JSON Web Token) will be included in the header of each API request for authentication.

The header consists of 4 key elements:

Authorization: This field stores the generated JWT token to authenticate the request. For example:

Authorization: Bearer <<token>>

X-Snowflake-Authorization-Token-Type: This field defines the type of authentication being used. For JWT authentication, the value should be KEYPAIR_JWT. For example:

X-Snowflake-Authorization-Token-Type: <<tokenType>>

Content-Type: This field specifies the format of the data being sent in the request or response body. For example:

Content-Type: application/json

Accept: This field Specifies the preferred content type or format of the response from the server. For example:

Accept: application/json

So a full header may look like:

Now that we have authenticated our instance and created the header for our requests, let's use it to fetch data.

Step 2: To retrieve the desired data from Snowflake, we need to submit a request to execute a SQL command. We'll combine our request header with a body containing the SQL command and submit it to the /api/v2/statements endpoint. This will allow us to fetch the necessary information from the Snowflake sample data.

The following headers need be set in each API request that you send within your application code:

Here's an example of how the header should look like:

Authorization: Bearer <<token>>
X-Snowflake-Authorization-Token-Type: <<tokenType>>
Content-Type: application/json
Accept: application/json

And, here is how your request body should look like:

{
"statement": "select C_NAME, C_MKTSEGMENT from snowflake_sample_data.tpch_sf1.customer",
"timeout": 30,
"database": "snowflake_sample_data",
"schema": "tpch_sf1",
"warehouse": "MY_WH",
"role": "ACCOUNTADMIN"
}

The request body includes the following fields with their respective functionalities in executing an SQL command:

Statement: This field contains the SQL command to be executed.
Timeout (optional): This field specifies the maximum number of seconds the query can run before being automatically canceled. It is optional. If not specified, it defaults to STATEMENT_TIMEOUT_IN_SECONDS which is 2 days.
Database, schema, warehouse (optional): These fields specify the execution context for the command. It is optional. If omitted, default values will be used.
Role (optional): This field determines the role to be used for running the query.

If the SQL statement submitted through the API request is successfully executed, Snowflake returns an HTTP response code of 200 and returns the rows in a JSON array object. The response may include metadata about the result set.

Here is the response of the Snowflake API request we submitted earlier.

{
  "resultSetMetaData": {
    "numRows": 150000,
    "format": "jsonv2",
    "partitionInfo": [
      {
        "rowCount": 2777,
        "uncompressedSize": 99945,
        "compressedSize": 9111
      },
          ........
          ........
          ........
          ........
      {
        "rowCount": 27223,
        "uncompressedSize": 980021,
        "compressedSize": 88732
      }
    ],
    "rowType": [
      {
        "name": "C_NAME",
        "database": "SNOWFLAKE_SAMPLE_DATA",
        "schema": "TPCH_SF1",
        "table": "CUSTOMER",
        "precision": null,
        "collation": null,
        "type": "text",
        "scale": null,
        "byteLength": 100,
        "nullable": false,
        "length": 25
      },
      {
        "name": "C_MKTSEGMENT",
        "database": "SNOWFLAKE_SAMPLE_DATA",
        "schema": "TPCH_SF1",
        "table": "CUSTOMER",
        "precision": null,
        "collation": null,
        "type": "text",
        "scale": null,
        "byteLength": 40,
        "nullable": true,
        "length": 10
      }
    ]
  },
  "data": [
    [
      "Customer#000000001",
      "BUILDING"
    ],
    [
      "Customer#000000002",
      "AUTOMOBILE"
    ],
          ........
          ........
  ],
  "code": "090001",
  "statementStatusUrl": "/api/v2/statements/01ad6582-0000-6241-0005-23fe0005a0b2?requestId=228295ad-373d-48a8-a191-a87e39dc1dfb",
  "requestId": "228295ad-373d-48a8-a191-a87e39dc1dfb",
  "sqlState": "00000",
  "statementHandle": "01ad6582-0000-6241-0005-23fe0005a0b2",
  "message": "Statement executed successfully.",
  "createdOn": 1688455829146
}

As you can see in the above response, Upon submitting a successful POST request, the QueryStatus object is returned at the end of the response. This object contains the necessary metadata to retrieve the data once the query is completed.

The key fields in the response are:

code : Contains the status code indicating the statement was submitted successfully
statementStatusUrl : The URL endpoint to query for the statement status
requestId : Unique ID for the request
sqlState : SQL state indicating no errors
statementHandle : Unique identifier to use when checking status
message : Confirmation the statement was submitted
createdOn : Timestamp of when the request was processed

Checking the Status of Statement Execution

Upon submitting a SQL statement for execution, if the execution is still in progress or an asynchronous query has been submitted, Snowflake responds with a 202 response code. In these scenarios, a GET request should be sent to the /api/v2/statements/ endpoint, with the **{{statementHandle}}** included as a path parameter in the URL.

The statementHandle serves as a unique identifier for a statement submitted for execution, and it can be found in the QueryStatus object of the initial POST request.

To check the execution status, use the following Snowflake SQL REST API request:

GET <<baseURL>>/api/v2/statements/<<statementHandle>>
--- Same as the previous request

Using the statementHandle obtained from the QueryStatus in the initial POST request, you can submit the GET request to retrieve the first partition of data. Before making the GET request, add the statementHandle value to your environment in Hoppscotch as a variable:

Step 1: Click on the "Environment" tab in Hoppscotch.

Step 2: Select the “Variables” that you want to update

Step 3: Paste the statementHandle value from the POST response as the variable value.

Step 4: Click "Save" to update the variable.

If the SQL command was successfully executed, a ResultSet object will be returned. This ResultSet contains metadata about the returned data as well as the first partition of data.

The returned object can be broken down into three primary areas:

resultSetMetaData: Metadata about the returned data.
rowType: Contains metadata about the returned data, including column names, data types, and lengths.
partitionInfo: Additional data partitions required to fetch the complete dataset.
data: Holds the first partition of data returned by the query, with all values represented as strings, regardless of data type.

Canceling Statement Execution

Finally, to cancel the execution of a statement, send a POST request to the /api/v2/statements/ endpoint and append the {{statementHandle}} to the end of the URL path followed by cancel as a path parameter.

The Snowflake API request to cancel the execution of a SQL statement is as follows.

POST request to <<baseURL>>/api/v2/statements/<<statementHandle>>/cancel
--- Same as the previous request

So by carefully following these steps and utilizing the Snowflake API, you can effectively execute SQL statements, retrieve data, and manage statement execution within your Snowflake instance.

To access the Hoppscotch workspace, you can check out the following gist: Hoppscotch Workspace Gist.

To use it, simply copy the JSON content, save it as a JSON file, and import it into the Hoppscotch collection.

Conclusion

Snowflake provides a robust REST API that allows you to programmatically access and manage your Snowflake data. Using the Snowflake API, you can build applications and workflows to query data, load data, create resources—and more—all via API calls. Hoppscotch is an open-source API development ecosystem that makes it easy to build, test, and share APIs. It provides a GUI for creating and editing requests, as well as a variety of tools for debugging and analyzing responses.

And that's it! In this article, we have explored the usage of the API tool like Hoppscotch to interact with Snowflake REST API. We have delved into the details of executing SQL statements through the API and constructing a Snowflake API workflow. To summarize, we authenticated our connection to Snowflake, ran SQL commands via API POST requests, added variables to improve usability, fetched and checked the current status of Statement execution, and even learned a way to cancel that statement execution.

Accessing Snowflake data via API calls is like building a superhighway to your data. With the right on-ramps and off-ramps in the form of API endpoints, you have an efficient roadway to transport data to and from your applications. Using the Snowflake API as the channel, and tools like Hoppscotch as the construction crew, you can architect an automated data superhighway.

FAQs

What is Hoppscotch?

Hoppscotch is an open-source API development ecosystem that allows developers to create, test, and manage APIs.

Is Hoppscotch compatible with Snowflake API?

Yes, Hoppscotch is designed to work with any API, including Snowflake's.

How can I test Snowflake API using Hoppscotch?

You can test Snowflake API by sending requests from Hoppscotch and analyzing the responses.

Can I manage Snowflake API with Hoppscotch?

Yes, Hoppscotch allows you to manage APIs, including creating, updating, and deleting requests.

Is it necessary to have coding skills to use Hoppscotch with Snowflake API?

Yes, basic understanding of APIs and how they work, but Hoppscotch's user-friendly interface makes it easy for non-developers to use as well.

How secure is it to use Hoppscotch with Snowflake API?

Hoppscotch prioritizes user security and does not store any data from your API requests. However, always ensure to follow best practices for API security.

Is there any cost associated with using Hoppscotch for Snowflake API?

Hoppscotch is a free, open-source tool. However, costs may be associated with the use of Snowflake's services.

Can the Snowflake SQL API run any SQL statement?

No, there are limitations in the types of statements that can be executed through the API. For example, GET and PUT statements, Python stored procedures are not supported.

Are there additional costs associated with using the API compared to running the SQL directly?’

It depends. The Snowflake API uses the cloud services layer to fetch results. Cloud services credits are only charged if it exceeds 10% of the WH credits usage.

Can the Snowflake API perform operations other than running SQL commands?

As of the writing of this article, officially the API can only run SQL commands. However, similar APIs are used by the SnowSight dashboard to show query history, query profiles, usage data. etc. These APIs are not documented and should not be relied on.

Snowflake Views Vs. Materialized Views: What's the Difference?

Pramit Marattha — Thu, 18 May 2023 06:32:49 +0000

In this article, we will explore the powerful capabilities of Snowflake views to simplify complex tables and streamline query workflows.

We'll begin by introducing what Snowflake views are, outlining their key differences, and discussing the pros and cons of each type. Additionally, we'll delve into various use cases that highlight how Snowflake non-materialized and materialized views can enhance query performance and address common workflow challenges.

So, if you're tired of struggling with unwieldy tables and lengthy query times, read on to discover how Snowflake views can make your life easier.

What Is a View and What Are the Different Types of Snowflake Views?

A view in Snowflake is a database object that allows you to see the results of a query as if it were a table. It's a virtual table that can be used just like a regular table in queries, joins, subqueries—and various other operations. Views serve various purposes, including combining, segregating, and protecting data.

You can use the CREATE VIEW command to create a view in Snowflake. The basic syntax for creating a view is CREATE VIEW AS .

Here's a simple example:

CREATE VIEW my_custom_view AS
SELECT column1, column2
FROM my_table
WHERE column3 = 'value';

What are the types of Views in Snowflake?

Non-Materialized (referred to as “views”)
Materialized Views
Secure Views

What is a Non-Materialized View (Snowflake views)?

Non-materialized view is a virtual table whose results are generated by running a simple SQL query whenever the view is accessed. The query is executed dynamically each time the view is referenced in a query, so the results are not stored for later/future use. Non-materialized views are very useful in simplifying complex queries and reducing redundancy. It can help you remove unnecessary columns, refine and filter out unwanted rows, and rename columns in a table, making it easier to work with the data.

Non-materialized views are commonly referred to as simply "views" in Snowflake.

The benefit of non-materialized views is that they are really very easy to create, and they do not consume storage space because the results are not stored for later. But remember that they may result in slower query performance as the underlying query must be executed each time the view is referenced.

Non-materialized views have a variety of use cases, including making complex queries simpler, creating reusable views for frequently used queries, and ensuring secure access to data by limiting the columns and rows that particular users can see or access.

Now, let's create one simple example of a non-materialized view in Snowflake. So to do that, let's first create one sample demo table and insert some dummy data into it:

CREATE TABLE employees (
  id INTEGER,
  name VARCHAR(50),
  department VARCHAR(50),
  salary INTEGER
);

INSERT INTO employees (id, name, department, salary)
VALUES (1, 'User1', 'HR', 50000),
       (2, 'User2', 'IT', 75000),
       (3, 'User3', 'Sales', 60000),
       (4, 'User4', 'IT', 80000),
       (5, 'User5', 'Marketing', 55000);

Now, let's create a view called "it_employees" that only includes the employees from the IT department:

CREATE VIEW it_employees AS
SELECT id, name, salary
FROM employees
WHERE department = 'IT';

So, when we query the "it_employees" view, we'll only see the data for the IT department employees:

SELECT * FROM it_employees;

What are Snowflake Materialized Views?

A Snowflake materialized view is a precomputed view of data stored in a table-like structure. It is used to improve query performance and reduce resource usage by precomputing the results of complex queries and storing them as cached result sets. Whenever subsequent queries are executed against the same data, Snowflake can access these materialized views directly rather than recomputing the query from scratch each time. However, it's important to note that the actual query using the materialized view is run on both the materialized data and any new data added to the table since the view was last refreshed. Overall, Snowflake materialized views can help improve query speed and optimize costs.

Note: Snowflake materialized views are exclusively accessible to users with an Enterprise Edition subscription.

How to Create a Materialized View?

Creating a materialized view in Snowflake is easy.

Here is a step-by-step example of how to create a materialized view in Snowflake

Step 1: let's create a table “employees_table” and insert some dummy data:

CREATE TABLE employees_table (
  id INTEGER,
  name VARCHAR(50),
  department VARCHAR(50),
  salary INTEGER
);

INSERT INTO employees_table VALUES
  (1, 'User1', 'Sales', 50000),
  (2, 'User_2', 'Marketing', 60000),
  (3, 'User3', 'Sales', 55000),
  (4, 'User_4', 'Marketing', 65000),
  (5, 'User5', 'Sales', 45000);

Step 2: Create a materialized view that aggregates the salaries by department.

CREATE MATERIALIZED VIEW materalized_view_employee_salaries
AS SELECT
  department,
  SUM(salary) AS total_salary
FROM employees_table
GROUP BY department;

Creating snowflake materialized view for employee salaries by department

The above query will create a materialized view called “materalized_view_employee_salaries” that calculates the total salaries for each department by aggregating the salaries in the “employees_table” table.

Note: GROUP BY clause is required in the query definition of the materialized view.

Step 3: You can then query the materialized view just like you would a regular table:

SELECT * FROM materalized_view_employee_salaries;

The output should show you the total salaries for each department, computed using the materialized view.

And that is how simple it is to create a Materialized view.

What are the benefits & limitations of Using a Snowflake Materialized View?

A Snowflake materialized view offers several benefits and limitations to consider when deciding whether to use it.

Benefits of using a Snowflake materialized view include:

Accelerated query performance for complex queries that require significant processing time.
Reduced query latency by providing pre-computed results for frequently executed queries.
Efficient incremental updates of large datasets.
Minimized resource usage and reduced compute costs by executing queries only against new data added to a table rather than the entire dataset.
A consistent interface for users to access frequently used data while shielding them from the underlying complexity of the database schema.
Faster query performance for geospatial and time-series data, which may require specialized indexing and querying techniques that can benefit from pre-computed results.

However, it's important to note that Snowflake materialized views also come with some limitations, including:

The ability to query only a single table.
No support for joins, including self-joins.
The inability to query materialized views, non-materialized views, or user-defined table functions.
The inability to include user-defined functions, window functions, HAVING clauses, ORDER BY clauses, LIMIT clauses, or GROUP BY keys that are not within the SELECT list.
The inability to use GROUP BY GROUPING SETS, GROUP BY ROLLUP, or GROUP BY CUBE.
The inability to include nested subqueries within a Snowflake materialized view.
The limited set of allowed aggregate functions, with no support for nested aggregate functions or combining DISTINCT with aggregate functions.
The inability to use aggregate functions AVG, COUNT, MIN, MAX, and SUM as window functions.
The requirement that all functions used in a Snowflake materialized view must be deterministic.
The inability to create a Snowflake materialized view using the Time Travel feature.

While Snowflake materialized views can provide significant performance benefits, it's important to consider their limitations when deciding whether to use them.

What are the key differences between Snowflake Views and Materialized Views?

Here are some key main differences between Snowflake non-materialized View and Materialized View:

Feature	Snowflake Materialized Views	Non-Materialized Views
Query from multiple tables	No	Yes
Support for self-joins	No	Yes
Pre-computed dataset	Yes	No
Computes result on-the-fly	No	Yes
Query speed	Faster	Slower
Compute cost	Charged on base table update	Charged on query
Storage cost	Incurs cost	No cost
Suitable for complex queries	Yes	No
Suitable for simple queries	No	Yes

What are the cost differences between Snowflake views and Snowflake materialized views?

There are significant differences between the costs of Snowflake Views and Snowflake Materialized views, as noted below:

	Snowflake Non-Materialized Views	Snowflake Materialized Views
Compute cost	Charged when queried	Charged when base table is updated
Storage cost	None	Incurs a cost for storing the pre-computed output
Suitable for	Frequently changing data	Infrequently changing data
Compute cost (frequency of updates)	More suitable for tables with constant streaming updates	Less suitable for frequently updated tables
Overall compute cost	Directly proportional to the size of the underlying base table	Directly proportional to the size of the underlying base table and frequency of updates

What are Snowflake Secure Views?

Snowflake secure views are a type of view in Snowflake that provides enhanced data privacy and security. These views prevent unauthorized users from accessing the underlying data in the base tables and restrict the visibility of the view definition to authorized users only.

Secure views are created using the SECURE keyword in the CREATE VIEW or CREATE MATERIALIZED VIEW command and are recommended for use when limiting access to sensitive data. BUT, remember that they may execute more slowly than non-secure views, so the trade-off between data privacy/security and query performance should be carefully considered.

You can refer to this official Snowflake documentation to learn more about secure views.

Conclusion

In conclusion, both Snowflake non-materialized views and Snowflake materialized views offer benefits and drawbacks, and choosing between the two depends on the specific use case. Non-materialized views are suitable for ad-hoc queries or constantly changing data, while materialized views are ideal for frequently queried data that is relatively static. Materialized views can provide significant performance gains but come at the cost of increased storage and compute usage, as well as additional costs each time the base table is updated. It's important to carefully evaluate your needs and use cases before selecting a view type to ensure optimal query performance and cost efficiency.

3 step guide to creating Snowflake Clone Table using Zero Copy Clone

Pramit Marattha — Tue, 16 May 2023 06:49:35 +0000

Snowflake zero copy clone feature allows users to quickly generate an identical clone of an existing database, table, or schema without copying the entire data, leading to significant savings in Snowflake storage costs and performance. The best part? You can do it all with just one simple command—the CLONE command. Gone are the days of copying complete structures, metadata, primary keys, and schemas to create a copy of your database or table.

In our previous article, we covered the basics of what is zero copy cloning in Snowflake. Now, in this article, we will dive into practical steps on how to set up databases, tables, and schemas, as well as insert dummy data for cloning purposes—and a lot more. Read on to find out more about how to create a Snowflake clone table using Snowflake zero copy clone!

So, let's get started!

How to Clone Table in Snowflake Using Zero Copy Clone?

Without further ado, let's get right to the juice of the article.

So to get started on cloning an object using Snowflake zero copy clone, you can use the following simple SQL statement:

CREATE <object_type> <object_name>
CLONE <source_object_name>

This particular statement is in short form. It will create a brand-new object by cloning an existing one. Now, let's explore its complete syntax.

CREATE [ OR REPLACE ] { STAGE | FILE FORMAT | SEQUENCE | STREAM | TASK } [ IF NOT EXISTS ] <object_name>
  CLONE <source_object_name>

Creating a Sample Table

Let's explore a real-world scenario by creating a database, schema, and table. First, we'll create a database named "my_db", a schema named "RAW" in that database, and a table named "my_table" inside that particular "RAW" schema. The table will have three columns: "id" of type integer, "name" of type varchar with a max length of 50 char, and "age" of type integer. Here's the SQL query:

CREATE OR REPLACE DATABASE my_db;
CREATE OR REPLACE SCHEMA my_db.RAW;
CREATE OR REPLACE TABLE my_db.RAW.my_table (
  id INT,
  name VARCHAR(50),
  age INT
);

Next, we'll insert 300 randomly generated rows into the table:

INSERT INTO my_db.RAW.my_table (id, name, age)
SELECT 
  seq4(),
  CONCAT('Some_Name', seq4()),
  FLOOR(RANDOM() * 100) + 1
FROM TABLE(GENERATOR(ROWCOUNT => 300));

Finally, we'll select the entire table:

SELECT COUNT(*) FROM my_db.RAW.my_table;

Your final query should resemble something like this.

CREATE OR REPLACE DATABASE my_db;
CREATE OR REPLACE SCHEMA my_db.RAW;
CREATE OR REPLACE TABLE my_db.RAW.my_table (
  id INT,
  name VARCHAR(50),
  age INT
);

INSERT INTO my_db.RAW.my_table (id, name, age)
SELECT 
  seq4(),
  CONCAT('Some_Name', seq4()),
  FLOOR(RANDOM() * 100) + 1
FROM TABLE(GENERATOR(ROWCOUNT => 300));

SELECT COUNT(*) FROM my_db.RAW.my_table;

Cloning the Sample Table

Now that we have our table, let's create a snowflake clone table of MY_DB.RAW.MY_TABLE and name it as MY_DB.RAW.MY_TABLE_CLONE.

CREATE TABLE my_db.RAW.my_table_clone 
CLONE my_db.RAW.my_table;

Finally, let's select the entire cloned table:

SELECT COUNT(*) FROM my_db.RAW.my_table_clone;

As you can see in the screenshot above, the count of MY_DB.RAW.MY_TABLE_CLONE matches the count of our main table, meaning that we have successfully created a snowflake clone table of the MY_DB.RAW.MY_TABLE table. But both of these tables are accessing the same storage since the data is the same in the original and cloned tables.

Understanding Table-Level Storage

If you require more comprehensive information on table-level storage, you can obtain it by executing the following query against the information schema view.

Note: Accessing this view requires the use of an ACCOUNTADMIN role.

USE ROLE ACCOUNTADMIN;

SELECT TABLE_NAME,
       ID,
       CLONE_GROUP_ID
FROM MY_DB.INFORMATION_SCHEMA.TABLE_STORAGE_METRICS
WHERE TABLE_CATALOG = 'MY_DB'
AND TABLE_SCHEMA = 'RAW'
AND TABLE_DROPPED IS NULL
AND CATALOG_DROPPED IS NULL
AND TABLE_NAME IN ('MY_TABLE', 'MY_TABLE_CLONE');

This particular query retrieves information about the storage of the tables in the MY_DB.RAW schema. The query result contains the table names, unique table IDs, and CLONE_GROUP_IDs. Each table has a unique identifier represented by the ID column, while the clone group ID is a unique identifier assigned to groups of tables that have identical data. In this scenario, MY_TABLE and MY_TABLE_CLONE have the same clone group ID, indicating that they share the same data.

Note: Although MY_TABLE and MY_TABLE_CLONE share the same data, they are still separate tables. Any sort of changes made to one table will not affect the other one.

Congratulations! With just a few simple steps, you have successfully created a Snowflake clone table using zero copy clone.

Conclusion

Snowflake zero copy clone feature is a powerful feature that enables users to efficiently generate identical clones of their existing databases, tables, and schemas without duplicating the data or creating separate environments. This article provided practical steps for setting up databases, tables, and schemas, inserting dummy data, and cloning data from scratch. We hope this article was informative and helpful in exploring the potential of the Snowflake zero copy clone feature to create a Snowflake clone table.

Interested in learning more about Snowflake zero copy clone? Be sure to check out our previous article, where we provided an in-depth overview of its inner workings, potential use cases, limitations, key features, benefits—and more!!

Snowflake Roles and Access Control: What You Need to Know 101

Pramit Marattha — Thu, 11 May 2023 17:20:43 +0000

In this article, we'll cover everything you need to know about Snowflake roles and access control, what default roles exist in Snowflake when an instance is created, what the role hierarchy is, explain how they work, and provide examples to help you better understand their capabilities and usefulness.

Overview of Snowflake Roles & Access Control

Snowflake access control system is meant to make sure that only authorized users and applications can access data and perform actions in the Snowflake environment.

Access Control Framework in Snowflake

Snowflake uses a combination of Role-Based Access Control (RBAC) and Discretionary Access Control (DAC) to provide a flexible and granular access control. We cover these concepts in detail later in the article.

Key elements of Snowflake access control framework

Securable object:

It is an entity that can be secured and to which access can be granted.
Access to a securable object is, by default, denied unless allowed by a grant.
Examples of securable objects are databases, schemas, tables, views, and functions in Snowflake.

Role:

It is an entity to which privileges can be granted.
Roles are used to manage and control access to securable objects in Snowflake.
Roles are assigned to users, and a user can have multiple roles.
Roles can also be assigned to other roles, creating a role hierarchy that enables more granular level control.

Privilege:

It is a defined level of access to a securable object.
Privileges are used to control the granularity of access granted.
Multiple distinct privileges can be used to control access to a securable object, such as the privileges of selecting, updating or deleting from a table.

User:

It is an entity to which you can define privileges.
Users are granted privileges through roles assigned to them.
Users can be assigned to one or more roles, granting them access to securable objects in Snowflake.

Understanding Access Control and its Relationships in Snowflake

Key points to understand the Access control relationships in Snowflake:

Access to securable objects is allowed via privileges assigned to roles
Roles can be assigned to other roles or individual users
Each securable object in Snowflake has an owner who can grant access to other roles.
Snowflake model differs from a user-based access control model, where rights and privileges are assigned to each user or group of users.

To explain it at a very high-level term, in Snowflake, there are things called "securable objects" that you can easily access it (as we have discussed briefly before). These objects can be things like databases, tables, schemas, tables, or views. But remember that you can't just access these objects without permission! You have to be given special rights, called "privileges", in order to access them.

Now, instead of giving each user their own privileges, Snowflake gives privileges to groups called "roles". So, for example, a role could be anything like "Data Scientist", "Data Analysts"..so on.. and that role would have certain privileges to access certain securable objects.

But it doesn't just stop there! Roles can also be assigned to other roles or even individual users. So, if a user is assigned to a role that has the right privileges to access a securable object, then that user can access that object too.

And lastly, also note that each securable object has an owner, and that owner can choose to grant access to other roles or individual users.

What are Securable Objects in Snowflake?

Every securable object is nested within a logical container in a hierarchy of containers. The ORGANIZATION is at the topmost container, while individual secure objects such as TABLE, VIEW, STAGE, UDF, FUNCTIONS, and other objects are stored within a SCHEMA object, which is contained in a DATABASE, and all of the DATABASE are contained within the ACCOUNT object.

Each securable object is associated with a single role, usually the role that created it. Users who are in control of this particular role can control over the securable object. The owner role has all privileges on the object by default, including granting or revoking privileges on the object to other roles. Also, note that ownership can be transferred from one role to another.

Source: Snowflake documentation

What are Snowflake Roles?

Roles are the entities to which privileges on securable objects can be granted and revoked. Their main purpose is to authorize users to carry out necessary actions within the organization. A user can be assigned multiple roles, which permits them to switch between roles and execute multiple actions using distinct sets of privileges. Each role is assigned a set of privileges, allowing users assigned to the role to access the resources they need. Roles can also be nested, allowing for more granular control over access to securable objects.

What types of Roles are available in Snowflake?

1) System-defined roles

System-defined roles in Snowflake are predefined roles that are automatically created when a Snowflake account is provisioned. These kinds of roles are designed to provide built-in access controls and permissions for Snowflake objects and resources.

ORGADMIN (Organization Administrator):

This role manages the operations at the organization level.
It has the ability to create accounts at the organization level.
It can view all accounts in the organization as well as all regions enabled for the organization.
It can also view usage information across the organization.

ACCOUNTADMIN (Account Administrator):

This role combines the power of SYSADMIN and SECURITYADMIN roles.
It Is considered as the top-level role in the Snowflake.
It should only be granted to a limited/controlled number of users in the account.

SECURITYADMIN (Security Administrator):

This role can manage any object grant globally.
It has the ability to create, monitor, and manage users and roles.
It is granted the MANAGE GRANTS security privilege to be able to modify any grant, including revoking it.
It inherits the privileges of the USERADMIN role via the system role hierarchy.

USERADMIN (User and Role Administrator):

This particular role is dedicated to user and role management only.
It is granted the CREATE USER and CREATE ROLE security privileges.
It can create users and roles in the account.
It can manage users and roles that it owns.

SYSADMIN (System Administrator):

This role has privileges to create warehouses, databases, and various other objects in the account.
It can grant privileges on warehouses, databases, and other objects to other roles if all custom roles are ultimately assigned to the SYSADMIN role.

PUBLIC:

This role is automatically granted to every user and every role in the account.
It can own securable objects, but the objects are available to every other user and role in the account.
It is typically used when explicit access control is not needed.

2) Custom Roles

Custom role in Snowflake is a role that is created by users with appropriate privileges to grant the role and user ownership on specific securable objects. Custom roles can be created using the USERADMIN role or higher, as well as by any role that has been granted the CREATE ROLE privilege.

Note: Whenever a custom role is created, it is not assigned to any user or granted to any other role

It is recommended to create a hierarchy of custom roles with the top-most custom role assigned to the system role SYSADMIN when creating roles that will serve as the owners of securable objects, which allows SYSADMIN to manage all objects in the account while restricting management of users and roles to the USERADMIN role. If a custom role is not assigned to SYSADMIN through a role hierarchy, then the SYSADMIN role cannot manage the objects owned by that role.

Source: Snowflake documentation

What is Privileges in Snowflake ?

Privileges define specific actions that users or roles are allowed to perform on securable objects in Snowflake.

Privileges are managed using the GRANT and REVOKE commands.

In non-managed schemas, these GRANT and REVOKE commands can only be used by the role that owns an object or any Snowflake roles with the MANAGE GRANTS privilege for that particular object whereas, in managed schemas, only the schema owner or a role with the MANAGE GRANTS privilege can grant privileges on objects in the schema, including future grants, which centralizes privilege management.

Understanding Snowflake Roles Hierarchy and Privileges

As you can see in the diagram below, which shows the full structure of system-defined and user-defined roles in Snowflake, the highest-level role is given to a custom account role, which is then granted to another custom role, allowing the SYSADMIN role to inherit all their privileges.

Let's explore a real-world example to fully understand what Snowflake access control really is. Okay, then let's first start by creating a User in Snowflake!

Creating a User in Snowflake: Step-by-Step Guide

First, head over to your Snowsight or Snowflake UI and then proceed to create an account using **ACCOUNTADMIN **profile.

Step 1: Login or Signup to your Snowflake account.

Step 2: Check and validate your role. To do that, you can check the role by clicking on the drop-down role option above, located at the top of the Snowflake web UI, or you can simply type the command mentioned below to check it.

SELECT current_role()

Step 3: Creating a Snowflake User Without Role/default role

Let's create a new user for this demo; for that we need to provide a password and an attribute called MUST_CHANGE_PASSWORD. There are two ways to create a user: you can either use the Snowflake web UI (by navigating to the Admin tab, then Users and Roles, and selecting "+ Users"),

or you can write a SQL command like the one below.

CREATE USER pramit_default_user 
    PASSWORD = 'pramit123' 
    COMMENT = 'Snowflake User Without Role/default role' 
    MUST_CHANGE_PASSWORD = FALSE;

Note: we haven't assigned any Snowflake roles to this user

Step 5: Now, login to that particular user and to do that all you have to do is simply open a new tab and add the credentials which you just created.

Once you have logged in you can see that by default you are assigned with the role called PUBLIC

or you can simply type the command mentioned below to check it.

SELECT current_role()

Step 6: Now, let's write some queries to see what kinds of privileges this role has. To do so, copy and paste the command below.

SHOW GRANTS TO role PUBLIC;

As shown in the screenshot above, the user "pramit_default_user" has very limited privileges, including only basic access to sample data and no access to any warehouse associated with this role. Therefore, you cannot run any queries that require compute resources, except for those queries that run only in the cloud services.

Before moving on to the next step, let's test if this privilege allows us to create a database. Let's find out! To do so, simply copy pasta the following command:

CREATE DATABASE test_db

Nope! It doesn't work! It throws error like "Insufficient privileges to operate on account 'FM33694'" meaning that "pramit_default_user" does not have any privileges to do anything in this profile.

Step 7: Finally, let's check how our user profile will look likeFirstly, get the details of the user. To do so, you need to type "DESCRIBE USER" followed by the username, as shown in the command below. When you execute this command, it displays and describes all the properties of the user.

DESCRIBE USER pramit_default_user

Secondly, lets get the grants that are currently available to this particular user named “**pramit_default_user”. **So for that simply type in the following command:

SHOW GRANTS ON USER pramit_default_user

By doing this, you can easily find out who created your account, what grants you have on your user profile, and what properties are associated with your user profile.

Always keep in mind that only ACCOUNTADMIN and SECURITYADMIN can create users in Snowflake. It is recommended that users be created with the SECURITYADMIN role and that no objects be created with the ACCOUNTADMIN role.

Creating/Assigning Snowflake Roles and Privilege to Users: Step-by-Step Guide

Creating a new user and assigning a default role as a SYSADMIN role:

Step 1: Navigate to the "Admin" Sidebar and click on the "Users & Roles" menu.

Step 2: Click on the "+ user" button to create a new user through the web UI (without using SQL commands).

**Step 3: **Uncheck the box named “Force user to change password on first time login” to skip changing the password

**Step 4: **Click the advance option dropdown menu and choose the default role as a system admin for the new user and add all the details.

Step 5: Click "Create user" to save the user details and default role.

Let's assign Snowflake roles to the new user using SQL commands:

Step 1: In the SQL worksheet, enter the "CREATE USER" SQL command to create the new user with password and add attributes called DEFAULT_ROLE **and **MUST_CHANGE_PASSWORD

CREATE USER pramit_default_user_02
    PASSWORD = 'pramit123' 
    DEFAULT_ROLE = "SYSADMIN" 
    MUST_CHANGE_PASSWORD = FALSE;

Step 2: Add a "GRANT ROLE" SQL statement to grant the system admin role to the new user.

GRANT ROLE "SYSADMIN" TO USER pramit_default_user_02;

Step 3: Log in with the new user's credentials.

Step 4: Check the profile tab to view the default role (SYSADMIN) and the public role or click on the drop-down role option above, located at the top of the Snowflake web UI, or you can simply type the command mentioned below to check it.

SELECT current_role()

Step 5: Run the "SHOW GRANTS TO USER" SQL command to view any additional Snowflake roles assigned to the new user.

SHOW GRANTS TO USER pramit_default_user_02

Now finally let's assign additional Snowflake roles to the new user to do so follow along the steps outlined below:

Step 1: In the SQL worksheet, enter "GRANT ROLE" SQL statements to assign additional Snowflake roles to the new user and run the SQL commands to assign the new roles to the user.

GRANT ROLE "ORGADMIN" TO USER pramit_default_user_02;
GRANT ROLE "SECURITYADMIN" TO USER pramit_default_user_02;
GRANT ROLE "USERADMIN" TO USER pramit_default_user_02;

Step 2: Refresh the user's roles in the UI

So this is how we can create a user and assign different Snowflake roles and privileges to the user. Suppose if you do not assign any role to the user, remember that the Snowflake automatically applies the default PUBLIC role.

Finally, we arrived at the main juice of the article! Let us now get into the guts of what Snowflake DAC is all about.

Role Hierarchy in Snowflake

Discretionary Access Control (DAC)

Every object in Snowflake is associated with an owner who has the authority to grant access to that object to other roles. For instance, in the screenshot below, **pramit_default_user_02 **is created by the **ACCOUNTADMIN **role and is assigned ownership of this object.

Let's delve even further into the topic!

Suppose we have a user USER_FIRST who has an ORGADMIN role and has created a db, a schema, and a table. Since USER_FIRST belongs to the ORGADMIN role, the ORGADMIN eventually becomes the owner of this object. Although USER_FIRST created the object within the Snowflake instance, they are not the owner of the object; the ORGADMIN role is the owner.

Any new user who gets the ORGADMIN role can also perform any action on this object because they also represent ownership of it under that role.

So, even if you delete USER_FIRST, you will still be able to access the objects. Any other user with the ORGADMIN role can act as the owner of this object. As an owner, the individual user can alter, drop, or perform any action with them. Owners can also easily grant different privileges or access as they wish and at their own discretion, which is why it is called Discretionary Access Control.

In Snowflake, a number of objects can exist under a schema or at the account level, and these objects may have been created by multiple users at various periods. As these users are part of a role, the ultimate owner of these objects is the role, not the individual users who created ‘em.

Ever thought about how Snowflake keeps track of who owns the objects and entities that users make? Snowflake follows a unique ownership concept that allows any user with the same role to operate on an object.

Let's dive deep into this concept and understand it even better.

To begin with, we will head back to our previous worksheet and execute three context functions: current account and current role. These functions will help us determine our current account and role.

select current_account(),current_role()

As you can see in the above screenshot that we are currently logged in as the ACCOUNTADMIN **role, and our account is **FM33694, and our role allows us to perform various actions on the account.

Now, to see a list of all the users and who created them, we will run the "show users" command.

SHOW users;

Note: This command can only be executed by the ACCOUNTADMIN role. In case you are currently logged in with a different role, you can easily switch to the ACCOUNTADMIN role by running the command "USE role ACCOUNTADMIN"

Next, we will create a database, a schema, and a table to understand the ownership concept with respect to other objects. To do so, let's switch back to the role of SYSADMIN and try out some examples

USE ROLE SYSADMIN

create database some_awesome_db;
create schema some_awesome_schema;
CREATE TABLE some_awesome_table_1(
    id INT PRIMARY KEY,
    name VARCHAR(50) NOT NULL
);

SHOW DATABASE;

SHOW SCHEMAS;

SHOW TABLES;

After successfully creating these objects, we noticed that they were all owned by the SYSADMIN role. This means any user with the SYSADMIN role can operate on these objects.

To verify this let's log in as another user which we previously created pramit_default_user_02 in another tab and executed the same context functions.

select current_user(), current_role();

SHOW DATABASE;

SHOW SCHEMAS;

SHOW TABLES;

As you can see from the screenshot above we found that we could see all the databases, schemas, and tables created by the SYSADMIN role.

Also, remember that we can even drop the schema and table we had created as pramit_default_user_02. . This serves as an best example of the ownership concept.

drop schema SOME_AWESOME_SCHEMA;

This is the core principle that Snowflake follows: every object or entity created by a user is owned by a role, and any user with that role has the power to change that object and grant various permissions and privileges to other roles.

Okay, now let's get into the guts of what Snowflake RBAC is all about!

Roles-based Access Control (RBAC)

In Snowflake, roles are used to group users with similar access requirements. Each role is assigned a set of privileges, allowing users assigned to the role to access the resources they need. Roles can also be nested, allowing for more granular control over access to securable objects.

To create a new Snowflake roles, you can use the following command:

CREATE ROLE <role-name>

Once a Snowflake role is created, you can grant system or object privileges to the role using the GRANT command. For example, to grant a role the privilege to create a table, you can use the following query:

GRANT CREATE TABLE ON DATABASE <database_name> TO ROLE <role_name>;

To assign a Snowflake role to a user, you can use the following query:

GRANT ROLE <role_name> TO USER <user_name>;

To view the Snowflake roles assigned to a user, you can use the following query:

SHOW GRANTS TO USER <user_name>;

To view the privileges granted to a role, you can use the following query:

SHOW GRANTS TO ROLE <role-name>

To revoke a privilege from a role, you can use the REVOKE command. For example, to revoke the privilege to create a table from a role, you can use the following query:

REVOKE CREATE TABLE ON DATABASE <database_name> FROM ROLE <role_name>;

Let's say you want to create a Snowflake role hierarchy for your data warehouse and assign different roles to different users.

First, head over to your Snowflake web UI and check your current account user and role. Let's assume that your current account user is "PRAMIT_DEFAULT_USER_02" and your role is "ACCOUNTADMIN".

Note: Snowflake recommends creating all roles with the "SECURITYADMIN" role.

You need to start by creating roles and granting privileges. To understand how the Snowflake hierarchy works, you can create multiple roles and assign multiple users to them.

Step 1: Create roles.

Start by creating roles for different types of users. For example, you might create sales managers, sales reps, and finance roles. Here are some example queries:

use role securityadmin;

create role "SALES_MANAGER_ROLE" comment = 'This is the role for sales managers';
create role "SALES_REP_ROLE" comment = 'This is the role for sales representatives';
create role "FINANCE_ROLE" comment = 'This is the role for finance team';

Step 2: Grant privileges to roles and create a role hierarchy

Next, grant appropriate privileges to each role. For example, Create a hierarchy of roles by granting roles to other roles. For example, you might create a "department manager" role that includes both the "project manager" and "development team" roles. Here are some example queries:

grant role "SALES_MANAGER_ROLE" to role "SECURITYADMIN";
grant role "SALES_REP_ROLE" to role "SALES_MANAGER_ROLE";
grant role "FINANCE_ROLE" to role "SALES_MANAGER_ROLE";

These above commands will first assign the "SALES_MANAGER_ROLE" role to "SECURITYADMIN", which means that the latter will inherit all the privileges associated with the former. Then, the "SALES_REP_ROLE" and "FINANCE_ROLE" roles will be assigned to "SALES_MANAGER_ROLE", which will also pass on their respective privileges to "SECURITYADMIN"

Step 3: Accessing the Graph

To see the visualization of the role hierarchy, head over to the Snowflake home dashboard, click on the admin sidebar panel, select "Users & Roles".

Once you have done that, navigate to the "Roles" tab. Here, you can see your role hierarchy represented in a graphical format.

Step 4: Create users

Create users and assign them to roles. For example, you might create users for sales managers, finance manager and slaes rep members. Here is how you can do it:

Note: Snowflake recommends creating all users with the "USERADMIN" role.

use role USERADMIN;
create user sales_manager_1 password = 'salesmanager123' comment = 'sales manager' must_change_password = false; 

create user finance_user password = 'finance123' comment = 'finanace user' must_change_password = false; 

create user sales_rep_user password = 'salesrep123' comment = 'finanace user' must_change_password = false;

Step 5: Assign roles to users

Finally, assign/grant appropriate roles to each user. For example, you might grant the "sales manager" role to the sales_manager_1 user and so on:

use role securityadmin;
-- Grant the sales_manager_role role to the user
GRANT ROLE sales_manager_role TO USER sales_manager_1;

-- Grant the sales_rep_role role to the user
GRANT ROLE sales_rep_role TO USER sales_rep_user;

-- Grant the finance_role role to the user
GRANT ROLE finance_role TO USER finance_user;

So by following these steps, you can easily create a Snowflake role hierarchy and assign different roles to different users according to their needs and responsibilities.

This is how the Snowflake role hierarchy works. By creating and assigning roles to users, you can control their access to your data warehouse, allowing them to perform only the relevant tasks according to their assigned roles.

Conclusion

Snowflake role management and access control features play a huge role in securing and managing access to resources in Snowflake.

In this article, we covered the following topics:

Access Control Framework
Key elements of Snowflake access control framework
Securable objects
Snowflake roles, default roles and types of Snowflake roles
Snowflake privileges
Snowflake Discretionary Access Control
Snowflake Role-Based Access Control
Role hierarchy and how it works
Examples of how to use roles to manage access privileges effectively

So, by using these features, you can create and implement a security architecture for your Snowflake that fits your needs and requirements.

Don't leave your Snowflake access controls and roles up in the air—take control! As they say, "Better safe than sorry, because when it comes to security, the sorry part can be very expensive!"

Snowflake Zero Copy Clone 101 - An Essential Guide 2023

Pramit Marattha — Wed, 10 May 2023 06:10:18 +0000

Introduction

Snowflake zero copy clone is an incredibly useful and advanced feature that allows users to clone a database, schema, or table quickly and easily without any additional Snowflake storage costs. What's more, it takes only a few minutes for Snowflake zero copy clone to complete without the need for complex manual configuration, as often done in conventional databases—depending on the size of the source item. This article covers all you need to know about Snowflake zero copy clone.

Let's dive in!

What is Snowflake zero copy clone?

Snowflake zero copy clone, often referred to as "cloning", is a feature in Snowflake that effectively creates an exact copy of a database, table, or schema without consuming extra storage space, taking up additional time, or duplicating any physical data. Instead, a logical reference to the source object is created, allowing for independent modifications to both the original and cloned objects. Snowflake zero copy cloning is fast and offers you maximum flexibility with no additional Snowflake storage costs associated with it.

Use-cases of Snowflake zero copy clone

Snowflake zero copy clone provides users with substantial flexibility and freedom, with use cases like:

To quickly perform backups of Tables, Schemas, and Databases.
To create a free sandbox to enable parallel use cases.
To enable quick object rollback capability.
To create various environments (e.g., Development,Testing, Staging, etc.).
To test possible modifications or developments without creating a new environment.

Snowflake zero copy clone provides businesses with smarter, faster, and more flexible data management capabilities.

How does Snowflake zero copy clone work?

The Snowflake zero copy clone feature allows users to clone a database object without making a copy of the data. This is possible because of the Snowflake micro-partitions feature, which divides all table data into small chunks that each contain between 50 and 500 MB of uncompressed data. However, the actual size of the data stored in Snowflake is smaller because the data is always stored compressed. When cloning a database object, Snowflake simply creates new metadata entries pointing to the micro-partitions of the original source object, rather than copying it for storage. This process does not involve any user intervention and does not duplicate the data itself—that's why it's called "zero copy clone".

To gain a better understanding, let's deep dive even further.

To illustrate this, consider a database table, EMPLOYEE table, and its cloned snapshot, EMPLOYEE_CLONE, in a Snowflake database. The metadata layer in Snowflake connects the metadata of EMPLOYEE ** to the micro-partitions in the storage layer where the actual data resides. When the **EMPLOYEE_CLONE table is created, it generates a new metadata set pointing to the same micro-partitions storing the data for EMPLOYEE. Essentially, the clone EMPLOYEE_CLONE table is a new metadata layer for EMPLOYEE rather than a physical copy of the data. The beauty of this approach is that it enables us to create clones of tables quickly without duplicating the actual data, saving time and storage space. Moreover, since the clone shares the same set of micro-partitions as the original table, any changes made to the data in one table will automatically reflect in the other.

In Snowflake, micro-partitions cannot be changed/altered once they are created. Suppose any modifications to the data within a micro-partition need to be made. In that case, a new micro-partition must be created with the updated changes (the existing partition is maintained to provide fail-safe measures and time travel capabilities). For instance, when data in the EMPLOYEE_CLONE table is modified, Snowflake replicates and assigns the modified micro-partition (M-P-3) to the staging environment, updating the clone table with the newly generated micro-partition (M-P-4) and references it exclusively for the EMPLOYEE_CLONE table, thereby incurring additional Snowflake storage costs only for the modified data rather than the entire clone.

What are the benefits of Snowflake zero copy clone?

Snowflake zero copy clone feature offers a variety of beneficial characteristics. Let's look at some of the key benefits:

Effective data cloning: Snowflake zero copy clone allows you to create fully-usable copies of data without physically copying the data, significantly reducing the time required to clone large objects.
Saves storage space and costs: It doesn't require the physical duplication of data or underlying storage, and it doesn't consume additional storage space, which can save on Snowflake costs.
Hassle-free cloning: It provides a straightforward process for creating copies of your tables, schemas, and databases using the keyword "CLONE" without needing administrative privileges.
Single-source data management: It creates a new set of metadata pointing to the same micro-partitions that store the original data. Each clone update generates new micro-partitions that relate solely to the clone.
Data Security: It maintains the same level of security as the original data. This ensures that sensitive data is protected even when it's cloned.

What are the limitations of Snowflake zero copy clone?

Snowflake zero copy clone feature offers many benefits. Still, there are certain limitations to keep in mind:

Resource requirements and performance impact: Cloning operations require adequate computing resources, so excessive cloning can lead to performance degradation.
Longer clone time for large micro-partitions: Cloning a table with a large number of micro-partitions may take longer, although it is still faster than a traditional copy.
Unsupported Object Types for Cloning: Cloning does not support all object types.

Which are the objects supported in Snowflake zero copy clone?

Snowflake zero copy clone feature supports cloning of the following database objects:

Databases
Schemas
Tables
Views
Materialized views
Sequences

Note: When a database object is cloned, the clone is not similar to the source object; rather, the clone is a reference to the original object, and modifications to the clone do not affect the source object. The clone will contain a new set of metadata, including a new set of access controls; so, the user must ensure that the appropriate permissions are granted for the clone.

How do access control works with cloned objects in Snowflake?

When using Snowflake's zero copy clone feature, it's important to keep in mind that cloned objects do not automatically inherit copy privileges from the source object. This means that an account admin(ACCOUNTADMIN) or the owner of the cloned object must explicitly grant any required privileges to the newly created clone.

If the source object is a database or schema, the granted privileges of any child objects in the source will be replicated to the clone. But, in order to create a clone, the current role must have the necessary privileges on the source object. For example, tables require the SELECT privilege, while pipelines, streams, and tasks require the OWNERSHIP privilege, and other object types require the USAGE privilege.

What are the account-level objects not supported in Snowflake zero copy clone?

Snowflake zero copy clone doesn't support particular objects that cannot be cloned. These include account-level objects, which exist at the account level. Some examples of account-level objects are:

Account-level roles
Users
Grants
Virtual Warehouses
Resource monitors
Storage integrations

Conclusion

Snowflake zero copy clone feature provides an innovative and cost-efficient way for users to clone tables without using additional Snowflake storage costs. This process streamlines the workflow, allowing databases, tables, and schemas to be cloned without creating separate environments.

This article provided an in-depth overview of Snowflake zero copy clone, from how it works to its potential use cases, and demonstrated how to set up and utilize the feature.

In the next article, we will cover how to clone a table in Snowflake. Stay tuned!

How to use Snowflake Time Travel to Recover Deleted Data?

Pramit Marattha — Tue, 09 May 2023 05:08:29 +0000

Introduction

Data, whether it be on customer information, financial records, transactions—and much more, is an indispensable asset for businesses. Unfortunately, it can be lost or damaged through human error or any technical issue. That's why having a robust data backup and recovery plan is crucial for any business that values its data. For Snowflake users, one feature that can help is Snowflake Time Travel. Snowflake Time Travel is a powerful feature of Snowflake that enables users to access historical data and recover deleted or corrupted data quickly and easily.

In this article, we'll talk about how powerful Snowflake Time Travel is and what it can do for Snowflake backup and recovery. We'll talk about the benefits of using Snowflake time travel to recover lost data and provide easy-to-follow steps on how to set it up and use it.

What is Snowflake Time Travel?

Snowflake Time Travel is a powerful feature that enables users to examine and analyze historical data, even if it has been modified or deleted. With Snowflake Time Travel, users can restore deleted objects, make duplicates, make a Snowflake backup and recovery of historical data, and look at how it was used in the past (historical data).

What are the benefits of Snowflake Time Travel?

Snowflake Time Travel offers a range of benefits, which include:

Provides protection for accidental or intentional data deletion.
Allows users to query and analyze historical data at any point in time within the defined retention period.
Allows cloning and restoring tables, schemas, and databases at specific points in time.
Minimizes the complexity of data recovery by providing a straightforward way to retrieve lost data without complicated Snowflake backup and recovery processes.
It helps keep track of how data is used and changed over time.
Offers a low-cost approach to continuous data protection.
Provides granular control over the retention period for different types of objects.
Automatically keeps track of historical data and doesn't need any extra setup or configuration.

Data Retention Period in Snowflake Time Travel

The data retention period is a critical component of Snowflake Time Travel. Whenever data is modified, Snowflake preserves the state of the data before the update, allowing users to perform Time Travel operations. The data retention period determines the number of days for which the historical data is preserved.

Snowflake Standard Edition has a retention period of 24 hours(1 day) by default and is automatically enabled for all Snowflake accounts. However, users can adjust this period by setting it to 0 (or resetting it to the default of 1 day) at the account and object level, including databases, schemas, and tables. For Snowflake Enterprise Edition and higher, the retention period can be set to 0 (or reset back to the default of 1 day) for transient and permanent databases, schemas, and tables. Permanent objects can have a retention period ranging from 0 to 90 days, giving users more flexibility and control over their data storage.

Whenever a data retention period ends, the historical data of the object will be moved into a failsafe, where past objects can no longer be queried, cloned, or restored. Snowflake's failsafe store data for up to seven days, giving users enough time to recover any lost or damaged data.

Setting the Data Retention Period for Snowflake Time Travel

Users with the ACCOUNTADMIN role can set the default retention period for their accounts using the DATA_RETENTION_TIME_IN_DAYS object parameter be set at the account, database, schema, or table level.

The default retention period for a database, schema, or individual table can be overridden using the parameter "DATA_RETENTION_TIME_IN_DAYS" during creation. Also, the retention period can be adjusted at any point in time, allowing users to customize it to suit their requirements.

Here is one example of a sample query that demonstrates how the "DATA_RETENTION_TIME_IN_DAYS" object parameter can be used to set a retention period of 30 days for a Snowflake table and database:

-- DBwith a retention period of 30 days
CREATE DATABASE my_database
DATA_RETENTION_TIME_IN_DAYS = 30;

-- Table with a retention period of 30 days
CREATE TABLE my_table (
  id INT,
  name VARCHAR,
  created_at TIMESTAMP
)
DATA_RETENTION_TIME_IN_DAYS = 30;

Let's take another example to understand it even better; let's say a schema has a parent database with a 10-day time travel value. The schema inherits that value. If you change the value of the parent database, the schema and any tables within it will inherit the new value.

You can also set an exact value for a specific object, which will not change even if its parent objects change. BUT temporary and transient tables can only have a time travel value of 1 day. Always remember that setting the value to 0 turns off the time travel feature, but you shouldn't do this at the account level because it only gives objects a default value. It's better to set individual objects' retention periods instead.

Use the following commands to set, alter, and display the DATA_RETENTION_TIME_IN_DAYS parameter value:

Set and display 90-day time travel at the account level:

ALTER ACCOUNT SET DATA_RETENTION_TIME_IN_DAYS=90;
SHOW PARAMETERS LIKE 'DATA_RETENTION_TIME_IN_DAYS' IN ACCOUNT;

Set and display 70-day time travel at the database level:

CREATE OR REPLACE DATABASE some_db DATA_RETENTION_TIME_IN_DAYS=60;
ALTER DATABASE some_db SET DATA_RETENTION_TIME_IN_DAYS=70;
SHOW PARAMETERS LIKE 'DATA_RETENTION_TIME_IN_DAYS' IN DATABASE some_db;

Set and display 50-day time travel at the schema level:

CREATE SCHEMA someschema DATA_RETENTION_TIME_IN_DAYS=40;
ALTER SCHEMA someschema SET DATA_RETENTION_TIME_IN_DAYS=50;
SHOW PARAMETERS LIKE 'DATA_RETENTION_TIME_IN_DAYS' IN SCHEMA someschema ;

Set and display 40-day time travel at the table level:

CREATE TABLE some_table (col1 string) DATA_RETENTION_TIME_IN_DAYS=10;
ALTER TABLE some_table SET DATA_RETENTION_TIME_IN_DAYS=40;
SHOW PARAMETERS LIKE 'DATA_RETENTION_TIME_IN_DAYS' IN TABLE some_table;

How to Enable or Disable Snowflake Time Travel?

Snowflake Time Travel is automatically enabled with the standard 1-day retention period.

However, if you want to extend the data retention period to 90 days for db, schemas, and tables, you can upgrade to Snowflake Enterprise Edition.

**Note: **Additional storage charges will apply for extended data retention.

Disable Snowflake Time Travel for the account (level).

Disabling Snowflake Time Travel for an account is not possible, but the data retention period can be set to 0 for all db, schemas, and tables created in the account by setting DATA_RETENTION_TIME_IN_DAYS to 0 at the account level. But remember that this default can be easily overridden for individual databases, schemas, and tables.

Now let's talk about the MIN_DATA_RETENTION_TIME_IN_DAYS parameter. This parameter does not alter or replace the DATA_RETENTION_TIME_IN_DAYS parameter value. It may, however, affect the effective data retention time.

The MIN_DATA_RETENTION_TIME_IN_DAYS parameter can be set at the account level to set a minimum data retention period for all databases, schemas, and tables without changing or replacing the DATA_RETENTION_TIME_IN_DAYS value. Whenever MIN_DATA_RETENTION_TIME_IN_DAYS is set at the account level, the effective data retention period for objects is determined by:

MAX(DATA_RETENTION_TIME_IN_DAYS, MIN_DATA_RETENTION_TIME_IN_DAYS)

Disable Snowflake Time Travel for individual db, schemas and tables

You cannot disable it for an account, but you may disable it for individual databases, schemas, and tables by setting DATA_RETENTION_TIME_IN_DAYS to 0. If MIN_DATA_RETENTION_TIME_IN_DAYS is greater than 0 and set at the account level, the higher value setting takes precedence.

How Snowflake Time Travel Works in Snowflake Backup and Recovery?

Now let's begin the process of recovering the deleted data from Snowflake.

Whenever a table performs any DML operations in Snowflake, the platform keeps track of previous versions of the table's data for a specific duration, enabling users to query previous versions of the data using the AT | BEFORE clause.

With the help of this AT | BEFORE clause, users can easily query data that existed either precisely at or just before a particular point in the table's history. The specified point can be a time-based value (like a timestamp) or a time offset from the present, or it can be the ID for a completed statement like SELECT or INSERT.

Querying Historical Data in Snowflake

let's begin!

Step 1: Login/signup to your Snowflake account.

Step 2: Open the Snowflake web UI and navigate to the worksheet where you want to recover the deleted data.

Step 3: Lets create a table named **awesome_first_table **with two columns id and name and insert three rows of data into the **awesome_first_table **table.

CREATE TABLE awesome_first_table (
  id INT PRIMARY KEY,
  name VARCHAR(50)
);
INSERT INTO awesome_first_table (id, name)
VALUES
  (1, 'abc'),
  (2, 'abc12'),
  (3, 'abc33');

Step 4: Let's start with a basic demo: delete records from the awesome first table table and recover them, but first select the entire table.

select * from awesome_first_table ;

Step 5: Create **temporary_awesome_first_table **table to hold recovered records

create table temporary_awesome_first_table like awesome_first_table;

Step 6: Now, let us delete all records from the awesome_first_table table.

delete from awesome_first_table ;

Step 7: Time to recover the records that are deleted a few mins ago

select * from awesome_first_table at(offset => -60*5);

Instead of using offset, you can also provide the TIMESTAMP, or STATEMENT.

Learn more from here.

Step 8: Finally, Copy all the records to temp tables

insert into temporary_awesome_first_table (select * from awesome_first_table at(offset => -60*5));

Cloning Objects with Snowflake Time Travel

You can use the AT | BEFORE clause with the CLONE keyword in the CREATE command for a table, schema, or database to create a logical duplicate of the object at a specific point in its history.

Snowflake does not have backups, but you can use cloning for backup purposes. If you have Enterprise Edition or higher, Snowflake supports time travel retention of up to 90 days. You can, however, create a zero-copy clone every 3 months to indefinitely preserve the object's history. You can save the table as a clone every 90 days for up to one year.

When you clone a table using Snowflake time travel, the DATA_RETENTION_TIME_IN_DAYS parameter value is also preserved in the cloned table.

After cloning a table, the parameter values are independent, meaning you can change the parameter value in the source table and it won't affect the clone.

You can use the CREATE TABLE, CREATE SCHEMA, and CREATE DATABASE commands with the CLONE keyword to create a clone of a table, schema, or database, respectively. The clone will represent the object as it existed at a specific point in its history.

To create a table clone, you can use the CREATE TABLE command:

CREATE TABLE restored_table CLONE my_table
  AT (TIMESTAMP => 'Sat, 09 May 2015 01:01:00 +0300'::timestamp_tz);

This above command will create a clone of **my_table **as it existed at the specified timestamp.

To create a clone of a schema and all its objects, you can use the following CREATE SCHEMA command:

CREATE SCHEMA restored_schema CLONE my_schema AT (OFFSET => -3600);

This above command will create a clone of **my_schema **and all its objects as they existed 1 hour before the current time.

To create a clone of a database and all its objects, you can use the following CREATE DATABASE command:

CREATE DATABASE restored_db CLONE my_db
  BEFORE (STATEMENT => '----------------------');

The above command will create a clone of **my_db **and all its objects as they existed before the completion of the specified statement.

Recovering Objects with Snowflake Time Travel

Dropping and restoring objects in Snowflake is a simple process that allows you to keep a copy of dropped objects for a certain period of time before purging. Here's what you should know:

Dropping Objects:

When a table, schema, or database is dropped in Snowflake, it is not immediately overwritten or removed from the system. Instead, it is retained for the object's data retention period, during which time the object can be restored. The object can only be restored within only 7 days period. However, once this period has elapsed, restoration of the object becomes impossible.

To drop an object, use one of the following commands:

DROP TABLE <table_name>;
DROP SCHEMA <schema_name>;
DROP DATABASE <database_name>;

Note: After dropping an object, creating an object with the same name does not restore the dropped object. Instead, it creates a new version of the object. The original, dropped version is still available and can be restored.

Listing Dropped Objects:

Dropped tables, schemas, and databases can be listed using the following commands with the HISTORY keyword specified:

For example,

SHOW TABLES HISTORY LIKE 'load%' IN mytestdb.myschema;

SHOW SCHEMAS HISTORY IN some_db;

SHOW DATABASES HISTORY;

As you can see in the screenshot above, the output includes all dropped objects and an additional DROPPED_ON column, which displays the date and time when the object was dropped. If an object has been dropped more than once, each version of the object is included as a separate row in the output.

**Note: **After the retention period for an object has passed and the object has been purged, it is no longer displayed in the SHOW HISTORY output.

Restoring Objects:

If an object has been dropped but is still listed in the output of SHOW HISTORY, it can be restored easily using the following commands:

Calling UNDROP restores the object to its most recent state before the DROP command was issued.

For example,

UNDROP TABLE mytable;

UNDROP SCHEMA myschema;

UNDROP DATABASE mydatabase;

Note: if an object with the same name already exists, UNDROP will fail. In this case, you must rename the existing object before restoring the previous version of the dropped object.

Top 4 Snowflake Time Travel Best Practices

1) Monitor Data Retention Periods

Snowflake allows users to set a Snowflake Time Travel retention period, specifying how long the platform should keep a history of changes. Snowflake stores Time Travel data for one day by default, but users can increase this period to 90 days. However, monitoring your retention period carefully is crucial to ensure that you only store data for a short amount of time. Longer retention periods can consume more storage, resulting in higher costs. Also, retaining unnecessary data for an extended period can pose a security risk, as it may contain sensitive information that should no longer be kept.

2) Monitor Storage Consumption

Snowflake Time Travel data can consume significant storage space, particularly when you have a long retention period. Therefore, it is essential to monitor your storage consumption carefully to ensure that you have the sufficient storage capacity to support your data warehousing needs. Snowflake provides various tools and features that can help you monitor your storage usage, including Storage Billing and Snowflake’s Query Profile UI. By monitoring your storage consumption, you can identify areas of inefficiency and optimize your data management practices to reduce costs and improve performance.

3) Implement an Extra Snowflake Backup and Recovery Plan

While Snowflake provides Time Travel capabilities, having an extra backup recovery plan in place is always good. Accidents can happen, and data loss can occur, making it critical to have a plan in place to ensure that you can recover your data in case of any mishap. One way to implement an extra backup recovery plan is to use Snowflake’s Data Replication feature, which allows you to create backups in real time on another Snowflake account, providing you with an additional layer of protection against data loss.

4) Cost Optimization

Cost optimization is a crucial factor when it comes to Snowflake Time Travel, as it can consume a significant amount of resources and add to your expenses. Therefore, monitoring your costs carefully and optimizing your data management practices to minimize expenses is essential. One way to optimize costs is by setting up data retention policies to ensure that you only store data for a short time.

If you're searching for tools to optimize Snowflake costs, using an observability tool like Chaos Genius can be incredibly beneficial. Chaos Genius gives you the best possible view of your Snowflake workflows. It breaks down costs into actionable insights and shows you where your Snowflake use could be improved. You can use this tool to pinpoint your Snowflake usage pattern and get informed cost-cutting recommendations, resulting in up to 10%–30% savings on Snowflake costs without sacrificing performance.

Schedule a demo with us today and see it for yourself!

Conclusion

Snowflake Time Travel is a powerful feature that simplifies data recovery on the Snowflake platform. In this article, we talked about how important it is for Snowflake users to have data recovery plans and Snowflake Time Travel. We also talked about the several benefits of using Snowflake Time Travel for data recovery, including its ability to retrieve historical data and rapidly and effectively recover deleted or corrupted data. Moreover, we also provided a step-by-step guide for setting up and using Snowflake Time Travel from the ground up.

Snowflake Time Travel is like having a wizard at your fingertips—a time-traveling data wizard—but without the wand or a hat. Simply put, it's a magical way to restore your data and turn back the clock on any mistakes, and it's as easy as saying "ABRACADABRA."

5 Best Snowflake Observability Tools for 2023

Pramit Marattha — Mon, 08 May 2023 07:59:28 +0000

Introduction

With the rise of cloud data warehouses and Business Intelligence, more and more organizations are starting to use Snowflake. While using Snowflake at scale, it’s imperative for data teams to have deep visibility into Snowflake costs & performance.

In this article, we will go over the 5 best tools for Snowflake observability. This can help data teams track their Snowflake costs, optimize Snowflake queries, and thereby reduce Snowflake costs.

Let’s dive in to find out how these powerful Snowflake Observability tools can make it easier for you to optimize Snowflake costs!

What is Snowflake Observability?

Observability is the ability to monitor a system’s performance using data collected from different parts of the system and perform a root cause analysis. This data is generated through tools and processes that are set up to track and measure system health and performance. (Read more on Observability vs Monitoring here)

"Snowflake Observability" means monitoring the health and performance of a Snowflake instance. By leveraging the power of Snowflake Observability tools, users can generate insights into the performance and behavior of their Snowflake data warehouse, identify/diagnose issues, and find the underlying root cause. These Snowflake Observability tools can also help data teams optimize Snowflake queries, reduce their resource consumption and improve performance. This can lead to more efficient use of Snowflake resources, ultimately helping them to reduce Snowflake costs.

5 best tools for Snowflake Observability

1) Snowflake Resource Monitors

Resource monitor is an official tool built by Snowflake for monitoring costs and avoiding unexpected credit usage caused by warehouse operations. It is the only tool that can monitor credit consumption and control (turn on or off) warehouses. It allows users to monitor credit usage and set limits for a specified interval or date range. Resource monitors can trigger various actions, such as sending alert notifications and/or suspending user-managed warehouses, when credit limits are reached or approached.

Note: Account administrators with the ACCOUNTADMIN role are the only ones who can create resource monitors. However, users with the MONITOR & MODIFY privileges can view and modify them.

Key features:

Cost control: It provides a way to limit the number of credits that Snowflake Data Warehouse can consume, helping you to manage costs and avoid unexpected credit usage.
Credit usage visibility: It provides users with a detailed overview of the credits they have consumed.
Monitor level: It allows users to set the monitor level to monitor credit usage for either the entire account or individual warehouses.
Custom monitoring schedules: It gives users the ability to set a custom schedule for when to start and stop monitoring credit usage.
Actions: It provides users with the ability to set up triggers or actions that specify a threshold for credit usage, allowing them to take action when that threshold is reached.
Custom alerts and notifications: It alerts users with notifications by email or in the web interface when a monitor triggers an action (notifications must be enabled), giving users a high level of customization and control over their credit monitoring process.
Flexible warehouse reactivation: It provides users with the ability to reactivate suspended warehouse by increasing the credit quota or threshold associated with the monitor.

2) Chaos Genius

Chaos Genius is a Snowflake DataOps Observability Platform. Chaos Genius is designed to help data teams manage and optimize their Snowflake data warehouse. It enables users to gain complete visibility into the performance of their Snowflake data warehouse and identify any key areas where they can improve efficiency, optimize query performance, and reduce Snowflake spending.

Key features

Snowflake Costs Dashboard: It provides real-time visualization of the costs associated with running a Snowflake data warehouse, which allows users to monitor Snowflake usage and identify key areas to reduce Snowflake costs.
Snowflake Warehouse Optimization: Chaos Genius helps data teams monitor and optimize Snowflake costs across different warehouses. It gives automated recommendations on warehouse right-sizing by identifying underutilized infrastructure.
Snowflake Query Optimization: It analyzes query patterns to identify inefficient queries and provides recommendations for improving performance.
Snowflake Storage Costs Optimization: It analyzes the storage usage patterns, identifies unused tables, and provides recommendations for optimizing storage costs.
Usage Reports & Alerting: It offers detailed usage reports and alerting features via email and Slack, providing users with a clear point of view of Snowflake usage and helping them identify any issues or anomalies.
Anomaly Detection: It helps users identify unusual usage patterns or unexpected costs, enabling them to quickly investigate and address any potential issues.

3) New Relic - Snowflake Integration

New Relic is an observability platform that lets users monitor, optimize, and fix their apps and infrastructure. The platform is capable of monitoring applications/infrastructure as well as being good at managing logs and errors.

There are numerous accessible New Relic integrations available, including Snowflake.

Integrating New Relic with Snowflake provides users with enhanced Snowflake observability, allowing them to gain a complete picture of their Snowflake's costs, performance, security, and availability.

Key features:

Interactive Dashboards: It provides a dashboard with interactive visualizations.
Alerts: It comes with 4 different alerts, such as bytes spilled to local or remote storage, failed queries, and queued queries. These alerts can be easily integrated into popular tools like Slack and PagerDuty.
Warehouse performance monitoring: It helps users monitor the performance of their Snowflake warehouse.
Custom data export: It offers easy export of custom data from Snowflake for external analysis and reporting.
Data ingestion: It allows users to ingest any data stored in Snowflake for comprehensive monitoring and analysis.
Inefficient Query Spotting: It points out inefficient queries by filtering longest-running queries and helping users to optimize the query performance and improve overall efficiency.
Integrations: It integrates with many tools and services, such as cloud platforms, messaging, and logging services.

4) Datadog - Snowflake Integration

Datadog is another cloud-based observability platform. It provides comprehensive, real-time visibility into your entire infrastructure, including cloud environments, servers, databases, applications—and much more. It enables users to monitor, troubleshoot, and optimize performance across their entire tech stack and provides a centralized dashboard for alerting/monitoring your usage, allowing you to identify potential issues quickly.

The platform integrates with well over 500 technologies. You can use the Datadog monitoring service On-Premise or as a Cloud-Based service. You can also use it with various cloud providers, including Snowflake, thus providing enhanced Snowflake observability. By using Datadog for Snowflake Monitoring, you can track Snowflake performance, identify long-running queries, and optimize them for faster results, which can ultimately help reduce Snowflake costs.

Key features:

Data usage monitoring: It enables users to monitor their Snowflake data usage to identify trends and optimize their Snowflake storage costs.
Cost analysis: It provides a detailed cost analysis for Snowflake that allows you to visualize and track the costs, and see what’s driving them.
Intuitive dashboard: It provides an intuitive and interactive dashboard to help you to visualize your Snowflake environment, including metrics such as warehouse utilization, query performance, and so on.
Anomaly detection: It helps users in detecting abnormal Snowflake storage usage patterns by comparing current usage to historical patterns and provides monitoring for the fluctuation in storage usage.
Misconfiguration detection + smart alerts: It can detect misconfigurations in users' Snowflake environment and send alerts when something unusual configuration is detected.

5) BI Dashboards: Snowflake Usage Templates

Snowflake offers basic BI dashboards on different BI platforms. While these do not offer observability, these are good first steps to get on top of your Snowflake usage and performance. Some of these dashboards are mentioned below:

However, these dashboards are basic visualizations and don’t offer any insights into optimizing warehouses, right-sizing them, query performance tuning etc.

Conclusion

Any business or organization that starts using Snowflake at scale must have Snowflake observability enabled. For small businesses, these can be as simple as BI dashboards provided by the likes of Looker, Thoughtspot or Tableau. Sometimes, data teams can also spin their own dashboards in Snowsight and use features like resource monitors to keep on top of costs.

However, as workloads and Snowflake users increase, it leads to the use of more powerful Snowflake Observability tools like Chaos Genius - which give advanced features like warehouse right-sizing recommendations, query tuning & performance improvement recommendations, storage cost reduction recommendations, in addition to alerting & reporting.

It's never too early to get started on Snowflake Observability!

22 Best DataOps Tools To Optimize Your Data Management and Observability In 2023

Pramit Marattha — Thu, 02 Feb 2023 08:30:04 +0000

The data landscape is rapidly evolving, and the amount of data being produced and distributed on a daily basis is downright staggering. According to the report by Statista, currently, there are approximately 120 zettabytes of data in existence (as of 2023), and this number is projected to reach 181 zettabytes by 2025.

As the volume of data continues to expand rapidly, so does the demand for efficient data management and observability solutions and tools. The actual value of data lies in how it is being utilized. Collecting and storing data alone is not enough; it must be leveraged and used correctly to get valuable insights. These insights can range from demographics to consumer behavior and even future sales predictions, providing an unparalleled resource for business decision-making processes. Also, with real-time data, businesses can make quick and informed decisions, adapt to the market and capitalize on live opportunities. However, this is only possible if data is of good quality, outdated, misleading, or difficult to access, which is precisely where DataOps comes to the rescue and plays a crucial role in optimizing and streamlining data management processes.

Unpacking the essence of DataOps

DataOps is a set of best practices and tools that aims to enhance the collaboration, integration, and automation of data management operations and tasks. DataOps seeks to improve the quality, speed, and collaboration of data management through an integrated and process-oriented approach, utilizing automation and agile software engineering practices similar to that of DevOps to speed up and streamline the process of accurate data delivery [1]. It is designed to help businesses and organizations better manage their data pipelines, reduce the workload and time required to develop and deploy new data-driven applications and improve the quality of the data being used.

Now that we have a clear understanding of what DataOps means let's delve deeper into its key components. The key components of a DataOps strategy include data integration, data quality management and measurement, data governance, data orchestration, and DataOps Observability.

Data integration

Data integration involves integrating and testing code changes and promptly deploying them to production environments, ensuring accuracy and consistency of data as it is integrated and delivered to appropriate teams.

Data quality management

Data Quality Management involves identifying, correcting, and preventing errors or inconsistencies in data, ensuring that the data being used is highly reliable and accurate.

Data governance

Data governance ensures that data is collected, stored, and used consistently, ethically and complies with regulations.

Data orchestration

Data orchestration helps to manage and coordinates data processing in a pipeline, specifying and scheduling tasks and dealing with errors to automate and optimize data flow through the data pipeline. It is crucial for ensuring the smooth operation and performance of the data through the data pipeline.

DataOps observability

DataOps observability refers to the ability to monitor and understand the various processes and systems involved in data management, with the primary goal of ensuring the reliability, trustworthiness, and business value of the data. It involves everything from monitoring and analyzing data pipelines to maintaining data quality and proving the data's business value through financial and operational efficiency metrics. DataOps observability allows businesses and organizations to improve the efficiency of their data management processes and make better use of their data assets. It aids in ensuring that data is always correct, dependable, and easily accessible, which in turn helps businesses and organizations make data-driven decisions, optimize data-related costs/spend and generate more value from it.

Top DataOps and DataOps Observability tools to simplify data management, cost & collaboration processes

One of the most challenging aspects of DataOps is integrating data from various sources and ensuring data quality, orchestration, observability, data cost management, and governance. DataOps aims to streamline these processes and improve collaboration among teams, enabling businesses to make better data-driven decisions and achieve improved performance and results [2]. In this article, we will focus on DataOps observability and the top DataOps tools businesses can use to streamline their data management, costs, and collaboration processes.

A wide variety of DataOps tools are available on the market, and choosing the right one can be a very daunting task. To help businesses make an informed decision, this article has compiled a list of the top DataOps tools that can be used to manage data-driven processes.

Data Integration Tools

1) Fivetran

Fivetran is a very popular and widely adopted data integration platform that simplifies the process of connecting various data sources to a centralized data warehouse [3]. This enables users or businesses to easily analyze and visualize their data in one place, eliminating the need to manually extract, transform, and load (ETL) data from multiple different sources.

Fivetran provides sets of pre-built connectors for a wide range of data sources, including popular databases, cloud applications, SaaS applications—and even flat files. These connectors automate the process of data extraction, ensuring that the data is always up-to-date, fresh and accurate. Once data is in the central data warehouse, Fivetran performs schema discovery and data validation, automatically creating tables and columns in the data warehouse based on the structure of the data source, making it really very easy to set up and maintain data pipelines without the need for manually writing custom code.

Fivetran also offers features like data deduplication, incremental data updates, and real-time data replication. These features help make sure that the data is always complete, fresh and accurate.

2) Talend Data Fabric

Talend Data Fabric solution is designed to help businesses and organizations make sure they have healthy data to stay in control, mitigate risk, and drive massive value. The platform combines data integration, integrity, and governance to deliver reliable data that businesses and organizations can rely on for decision-making processes. Talend helps businesses build customer loyalty, improve operational efficiency and modernize their IT infrastructure.

Talend's unique approach to data integration makes it easy for businesses and organizations to bring data together from multiple sources and power all their business decisions. It can integrate virtually any data type from any data source to any data destination(on-premises or in the cloud). The platform is flexible, allowing businesses and organizations to build data pipelines once and run them anywhere, with no vendor or platform lock-in. And the solution is an all-in-one (unified solution), bringing together data integration, data quality, and data sharing on an easy-to-use platform.

Talend's Data Fabric offers a multitude of best-in-class data integration capabilities, such as Data Integration, Pipeline Designer, Data Inventory, Data Preparation, Change Data Capture, and Data Stitching. These tools make data integration, data discovery/search and data sharing more manageable, enabling users to prepare and integrate data quickly, visualize it, keep it fresh, and move it securely.

3) StreamSets

StreamSets is a powerful data integration platform that allows businesses to control and manage data flow from a variety of batch and streaming sources to modern analytics platforms. You can deploy and scale your dataflows on-edge, on-premises, or in the cloud using its collaborative, visual pipeline design, while also mapping and monitoring them for end-to-end visibility[4]. The platform also allows for the enforcement of Data SLAs for high availability, quality, and privacy. StreamSets enables businesses and organizations to quickly launch projects by eliminating the need for specialized coding skills through its visual pipeline design, testing, and deployment features, all of which are accessible via an intuitive graphical user interface. With StreamSets, brittle pipelines and lost data will no longer be a concern, as the platform can automatically handle unexpected changes. The platform also includes a live map with metrics, alerting, and drill-down functionality, allowing businesses to efficiently integrate data in a breeze.

4) K2View

K2View provides enterprise-level DataOps tools. It offers a data fabric platform for real-time data integration, which enables businesses and organizations to deliver personalized experiences [6]. K2View's enterprise-level data integration tools integrate data from any kind of source and make it accessible to any consumer through various methods such as bulk ETL, reverse ETL, data streaming, data virtualization, log-based CDC, message-based integration, SQL—and APIs.

K2View can ingest data from various sources and systems, enhance it with real-time insights, convert it into its patented micro-database, and ensure performance, scalability, and security by compressing and encrypting the micro-database individually. It then applies data masking, transformation, enrichment, and orchestration tools on-the-fly to make the data accessible to authorized consumers in any format while adhering to data privacy and security rules.

5) Alteryx

Alteryx is a very powerful data integration platform that allows users to easily access, manipulate, analyze, and output data. The platform utilizes a drag-and-drop interface (low code/no code interface) and includes a variety of tools and connectors(80+) for data blending, predictive analytics, and data visualization[7]. It can be used in a one-off manner or, more commonly, as a recurring process called a "workflow." The way Alteryx builds workflows also serves as a form of process documentation, allowing users to view, collaborate, support and enhance the process. The platform can read and write data to files, databases, and APIs, and it also includes functionality for predictive analytics and geospatial analysis. Alteryx is currently being used in a variety of industries and functional areas and can be used to more quickly and efficiently automate data integration processes. Some common use cases include combining and manipulating data within spreadsheets, supplementing SQL development, APIs, cloud or hybrid access, data science, geospatial analysis—and creating reports and dashboards.

Note: Alteryx is often compared to ETL tools, but it is important to remember that its primary audience is data analysts. Alteryx aims to empower business users by giving them the freedom to access, manipulate, and analyze data without relying on IT.

Data Quality Testing and Monitoring Tools

1) Monte Carlo

Monte Carlo is a leading enterprise data monitoring and observability platform. It provides an end-to-end solution for monitoring and alerting for data issues across the data warehouses, data lakes, ETL, and business intelligence platforms. It uses machine learning and AI to learn about the data and proactively identify data-related issues, assess their impact, and notify those who need to know. The platform's automatic and immediate identification of the root cause of issues allows teams to collaborate and resolve problems faster, and it also provides automatic, field-level lineage , data discovery, and centralized data cataloging that allows teams to better understand the accessibility, location, health, and ownership of their data assets. The platform is designed with security in mind, scales accordingly with the provided stack, and includes a no-code or low-code(code-free) onboarding feature for easy implementation with the existing data stack.

2) Databand

Databand is a data monitoring and observability platform recently acquired by IBM that helps organizations detect and resolve data issues before they impact the business. It provides a fierce, end-to-end view of data pipelines, starting with source data, which allows businesses and organizations to detect and resolve issues early, reducing the mean time to detection (MTTD) and mean time to resolution (MTTR) from days and weeks to minutes.

One key features of Databand is its ability to automatically collect metadata from modern data stacks such as Airflow, Spark, Databricks, Redshift, dbt, and Snowflake. This metadata is used to build historical baselines of common data pipeline behavior, which allows organizations to get visibility into every data flow from source to destination.

Databand also provides incident management, end-to-end lineage, data reliability monitoring, data quality metrics, anomaly detection, and DataOps alerting and routing capabilities. With this, businesses and organizations can improve data reliability and quality and visualize how data incidents impact upstream and downstream components of the data stack. Databand's combined capabilities provide a single solution for all data incidents, allowing engineers to focus on building their modern data stack rather than fixing it.

3) Data Fold

Datafold is a data reliability platform focused on proactive data quality management that helps businesses prevent data catastrophes. It has the unique ability to detect, evaluate, and investigate data quality problems before they impact productivity. The platform offers real-time monitoring to identify issues quickly and prevent them from becoming data catastrophes.

Datafold harnesses the power of machine learning with AI to provide analytics with real-time insights, allowing data engineers to make top-quality predictions from large amounts of data.

Some of the key features of Datafold include:

One-Click Regression Testing for ETL
Data flow visibility Across all pipelines and BI reports
SQL Query Conversion, Data Discovery, and Multiple data source Integrations

Datafold offers a simple yet intuitive user interface(UI) and navigation with powerful features. The platform allows deep exploration of how tables and data assets relate. The visualizations are really very easy to understand. Data quality monitoring is also super flexible. However, the data integrations they support are relatively limited.

4) Query Surge

QuerySurge is a very powerful/versatile tool for automating data quality testing and monitoring, particularly for big data, data warehouses, BI reports, and enterprise-level applications. It is particularly designed to integrate seamlessly, allowing for continuous testing and validation of data as it flows.

Query Surge also provides the ability to create and run tests without needing to write SQL through smart query wizards. This allows for column, table, and row-level comparisons and automatic column matching. Also, users can create custom tests that can be modularized with reusable "snippets" of code, set thresholds, check data types, and perform other advanced number of validation checks. QuerySurge also has robust scheduling capabilities, allowing users to run tests pronto, at a specified date and time. On top of that, it also supports 200+ supported vendors and tech stacks, so it can test across a wide variety of platforms, including big data lakes, data warehouses, traditional databases, NoSQL document stores, BI reports, flat files, JSON files—and a whole lot more.

One key benefits of QuerySurge is its ability to integrate with other solutions in the DataOps pipeline, such as data integration/ETL solutions, build/configuration solutions, QA and test management solutions. The tool also includes a Data Analytics Dashboard, which allows users to monitor test execution progress in real-time, drill down into data to examine results, and see stats for executed tests. It also has an out-of-the-box integration with plethora of services and any other solution with API access.

QuerySurge is available both on-premises and in the cloud, with support for AES 256-bit encryption, LDAP/LDAPS, TLS, HTTPS/SSL, auto-timeout, and other security features. In a nutshell, QuerySurge is a very powerful and comprehensive solution for automating data monitoring and testing, allowing businesses and organizations to improve their data quality at speed and reduce the risk of data-related issues in the delivery pipeline.

5) Right Data

Right Data's RDT is a powerful data testing and monitoring platform that helps businesses and organizations improve the reliability and trust of their data by providing an easy-to-use interface for data testing, reconciliation, and validation. It allows users to quickly identify issues related to data consistency, quality, and completeness. It also provides an efficient way to analyze, design, build, execute and automate reconciliation and validation scenarios with little to no coding required, which helps save time and resources.

Key features of RDT:

Ability to analyze DB: It provides a full set of applications to analyze the source and target datasets. Its top-of-the-line Query Builder and Data Profiling features help users understand and analyze the data before they use the corresponding datasets in different scenarios.
Support of a wide range of data sources: RDT supports a wide range of data sources such as ODBC or JDBC, flat files, cloud technologies, SAP, big data, BI reporting—and various other sources. This allows businesses and organizations to easily connect to and work with their existing data source.
Data reconciliation: RDT has features like "Compare Row Counts" that let users compare the number of rows in the source dataset and the target dataset and find tables where the number of rows doesn't match. It also provides a "row-level data compare" feature that compares datasets between source/target and identifies rows that do not match each other.
Data Validation: RDT provides a user-friendly interface for creating validation scenarios, which enables users to establish one or more validation rules for target data sets, identify exceptions, and analyze and report on the results.
Admin & CMS: RDT has an admin console that allows the admin to manage and config the features of the tool. The console provides the ability to create + manage users, roles, and the mapping of roles to specific users. Administrators can also create, manage, and test connection profiles, which are used to create queries. The tool also provides a Content Management Studio (CMS) that enables exporting of queries, scenarios, and connection profiles from one RightData instance to another. This feature is useful for copying within the same instance from one folder to another and also for switching over the connection profile of queries.

DataOps Observability and Augmented FinOps

1) Chaos Genius

Chaos Genius is a powerful DataOps Observability tool that uses ML and AI to provide precise cost projections and enhanced metrics for monitoring and analyzing data and business metrics. One of the main reasons the tool was built was to provide value to businesses by offering a powerful, first-in-class DataOps observability tool that can help monitor and analyze data, lower spending, and improve business metrics. The tool utilizes machine learning and artificial intelligence (ML/AI) to sift through data and provide more precise cost projections and enhanced metrics.

Chaos Genius currently offers a service called "Snowflake Observability" as one of its main services.

Key features of Chaos Genius (Snowflake Observability) include:

Cost optimization and monitoring: Chaos Genius is designed to help businesses and organizations optimize and monitor the cost of a Snowflake cloud data platform. This includes finding places where costs can be cut and making suggestions for how to do so.
Enhanced query performance: Chaos Genius can analyze query patterns to identify inefficient queries and make smart recommendations to improve their performance, which can lead to faster and more efficient data retrieval and improve the overall performance of the data warehouse.
Reduced Spendings: Chaos Genius enables businesses to better enhance the efficiency of their systems and reduce total spending by about ~10% - 30%.
Affordability: Chaos Genius offers an affordable pricing model with three tiers. The first tier is completely free, while the other two are business-oriented plans for companies that want to monitor more metrics. This makes it accessible to businesses of all sizes and budgets.

2) Unravel

Unravel is a DataOps observability platform that provides businesses and organizations with a thorough view of their entire data stack and helps them optimize performance, automate troubleshooting, and manage and monitor the cost of their entire data pipelines. The platform is also designed to work with different cloud service providers, for example, Azure, Amazon EMR, GCP, Cloudera and even on-premises environments, providing businesses with the flexibility to manage their data pipeline regardless of where their data resides.

Unravel uses the power of machine learning and AI to model data pipelines from end to end, providing businesses with a detailed understanding of how data flows through their systems. This enables businesses/organizations to identify bottlenecks, optimize resource allocation and improve the overall performance of their data pipelines.

The platform's data model enables businesses to explore, correlate, and analyze data across their entire environment, providing deep insights into how apps, services, and resources are used and what works and what doesn't, allowing businesses to quickly identify potential issues and take immediate action to resolve them. Not only that, but Unravel also has automatic troubleshooting features that can help businesses find the cause of a problem quickly and take steps to fix it, saving them a huge amount of spending and making their data pipelines more reliable and efficient.

Data Orchestration Tools

1) Apache Airflow

Apache Airflow is a fully open source DataOps workflow orchestration tool to author, schedule, and monitor workflows programmatically. Airbnb first developed it, but now it is under the Apache Software Foundation [8]. It is a tool for expressing and managing data pipelines and is often used in data engineering. It allows users to define, schedule, and monitor workflows as directed acyclic graphs (DAGs) of tasks. Airflow provides a simple and powerful way to manage data pipelines, and it is simple to use, allowing users to create and manage complex workflows quickly; on top of that, it has a large and active community that provides many plugins, connectors, and integrations with other tools that makes it very versatile.

Key features of Airflow include:

Dynamic pipeline generation: Airflow's dynamic pipeline generation is one of its key features. Airflow allows you to define and generate pipelines programmatically rather than manually creating and managing them. This facilitates the creation and modification of complex workflows.
Extensibility: Airflow allows using custom plugins, operators and executors, which means you can add new functionality to the platform to suit your specific needs and requirements; this makes Airflow highly extensible and an excellent choice for businesses and organizations with unique requirements or working with complex data pipelines.
Scalability: Airflow has built-in support for scaling thousands of tasks, making it very well-suited for large-scale organizations or running large-scale data processing tasks.

2) Shipyard

Shipyard is a powerful data orchestration tool designed to help data teams streamline and simplify their workflows and deliver data at very high speed. The tool is intended to be code-agnostic, allowing teams to deploy code in any language they prefer without the need for a steep learning curve. It is cloud-ready, meaning it eliminates the need for teams to spend hours and hours spinning up and managing their servers. Instead, they can orchestrate their workflows in the cloud, allowing them to focus on what they do best—working with data. Shipyard can also run thousands of jobs at once, making it ideal for scaling data processing tasks. The tool can dynamically scale to meet the demand, ensuring that workflows run smoothly and efficiently even when dealing with large amounts of data.

Shipyard comes with a very intuitive visual UI, allowing users to construct workflows directly from the interface and make changes as needed by dragging and dropping. The advanced scheduling, webhooks and on-demand triggers make automating workflows on any schedule easy. It also allows for cross-functional workflows, meaning that the entire data process can be interconnected across the entire data lifecycle, helping teams keep track of the entire data journey, from data collection and processing to visualization and analysis.

Shipyard also provides instant notifications, which help teams catch and fix critical breakages before anyone even notices. It also has automatic retries and cutoffs, which give workflows resilience, so teams don't have to lift a finger. Not only that, it can isolate and address the root cause in real time, so teams can get back up and running in seconds. Also, it allows teams to connect their entire data stack in minutes, seamlessly moving data between the existing tools in the data stack, regardless of the cloud provider. With over 20+ integrations and 60+ low-code templates to choose from, data teams can begin connecting their existing tools in record speeed!!!

3) Dagster

Dagster is a next-generation open source data orchestration platform for developing, producing, and observing data assets in real-time. Its primary focus is to provide a unified experience for data engineers, data scientists, and developers to manage the entire lifecycle of data assets, from development and testing to production and monitoring. Using Dagster, users can manage their data assets with code and monitor "runs" across all jobs in one place with the run timeline view. On the other hand, the run details view allows users to zoom into a run and pin down issues with surgical precision.

Dagster also allows users to see each asset's context and update it all in one place, including materializations, lineage, schema, schedule, partitions—and a whole lot more. Not only that, but it also allows users to launch and monitor backfills over every partition of data. Dagster is an enterprise-level orchestration platform that prioritizes developer experience(DX) with fully serverless + hybrid deployments, native branching, and provides out-of-the-box CI/CD configuration.

4) AWS glue

AWS Glue is a data orchestration tool that makes it easy to discover, prepare, and combine data for analytics and machine learning workflows. With Glue, you can crawl data sources, extract, transform and load (ETL) data, and create/schedule data pipelines using a simple visual UI interface. Glue can also be used for analytics and includes tools for authoring, running jobs, and implementing business workflows. AWS Glue offers data discovery, ETL, cleansing, and central cataloging and allows you to connect to over 70 diverse data sources [9]. You can create, run and monitor ETL pipelines to load data into data lakes and query cataloged data using Amazon Athena, Amazon EMR, and Redshift Spectrum. It is serverless in nature, meaning there's no infrastructure to manage, and it supports all kinds of workloads like ETL, ELT, and streaming all packaged in one service. AWS Glue is very user-friendly and is suitable for all kinds of users, including developers and business users. Its ability to scale on demand allows users to focus on high-value activities that extract maximum value from their data; it can handle any data size and support all types of data and schema variations.

AWS Glue provides TONS of awesome features that can be used in a DataOps workflow, such as:

Data Catalog: A central repository to store structural and operational metadata for all data assets.
ETL Jobs: The ability to define, schedule, and run ETL jobs to prepare data for analytics.
Data Crawlers: Automated data discovery and classification that can connect to data sources, extract metadata, and create table definitions in the Data Catalog.
Data Classifiers: The ability to recognize and classify specific types of data, such as JSON, CSV, and Parquet.
Data Wrangler: A visual data transformation tool that makes it easy to clean and prepare data for analytics.
Security: Integrations with AWS Identity and Access Management (IAM) and Amazon Virtual Private Cloud (VPC) to help secure data in transit and at rest.
Scalability: The ability to handle petabyte-scale data and thousands of concurrent ETL jobs.

Data Governance Tools

1) Collibra

Collibra is an enterprise-oriented data governance tool that helps businesses and organizations understand and manage their data assets. It enables businesses and organizations to create an inventory of data assets, capture metadata about 'em, and govern these assets to ensure regulatory compliance. The tool is primarily used by IT, data owners, and administrators who are in charge of data protection and compliance to inventory and track how data is used. Collibra's main aim is to protect data, ensure it is appropriately governed and used, and eliminate potential fines and risks from a lack of regulatory compliance.

Collibra offers six key functional areas to aid in data governance:

Collibra Data Quality & Observability: Monitors data quality and pipeline reliability to aid in remedying anomalies.
Collibra Data Catalog: A single solution for finding and understanding data from various sources.
Data Governance: A location for finding, understanding, and creating a shared language around data for all individuals within an organization.
Data Lineage: Automatically maps relationships between systems, applications, and reports to provide a comprehensive view of data across the enterprise.
Collibra Protect: Allows for the discovery, definition, and protection of data from a unified platform.
Data Privacy: Centralizes, automates, and guides workflows to encourage collaboration and address global regulatory requirements for data privacy.

2) Alation

Alation is an enterprise-level data catalog tool that serves as a single reference point for all of an organization's data. It automatically crawls and indexes over 60 different data sources, including on-premises databases, cloud storage, file systems, and BI tools. Using query log ingestion, Alation parses queries to identify the most frequently used data and the individuals who use it the most, forming the basis of the catalog. Users can then collaborate and provide context for the data. With the catalog in place, data analysts and scientists can quickly and easily locate, examine, verify, and reuse data, hence boosting their productivity. Alation can also be used for data governance, allowing analytics teams to efficiently manage and enforce policies for data consumers.

Key benefits of using Alation:

Boost analyst productivity
Improve data comprehension
Foster collaboration
Minimize the risk of data misuse
Eliminate IT bottlenecks
Easily expose and interpret data policies

Alation offers various solutions to improve productivity, accuracy and data-driven decision-making. These include:

Alation Data Catalog: Improves the efficiency of analysts and the accuracy of analytics, empowering all members of an organization to find, understand, and govern data efficiently.
Alation Connectors: A wide range of native data sources that speed up the process of gaining insights and enable data intelligence throughout the enterprise. (Additional data sources can also be connected with the Open Connector Framework SDK.)
Alation Platform: An open and intelligent solution for various metadata management applications, including search and discovery, data governance, and digital transformation.
Alation Data Governance App: Simplifies secure access to the best data in hybrid and multi-cloud environments.
Alation Cloud Service: Offers businesses and organizations the option to manage their data catalog on their own or have it managed for them in the cloud.

Data Cloud and Data Lake Platforms

1). Databricks

Databricks is a cloud-based lakehouse platform founded in 2013 by the creators of Apache Spark, Delta Lake, and MlFlow [10]. It unifies data warehousing and data lakes to provide an open and unified platform for data and AI. The Databricks Lakehouse architecture is designed to manage all data types and is cloud-agnostic, allowing data to be governed wherever it is stored. Teams can collaborate and access all the data they need to innovate and improve. The platform includes the reliability and performance of Delta Lake as the data lake foundation, fine-grained governance and support for persona-based use cases. It also provides instant and serverless compute, managed by Databricks. The Lakehouse platform eliminates the challenges caused by traditional data environments such as data silos and complicated data structures. It is simple, open, multi-cloud, and supports various data team workloads. The platform allows for flexibility with existing infrastructure, open source projects, and the Databricks partner network.

2) Snowflake

Snowflake is a cloud data platform offering a software-as-a-service model fo storing and analyzing LARGE amounts of data. It is designed to support high levels of concurrency, scalability and performance. It allows customers to focus on getting value from their data rather than managing the infrastructure on which it's stored. The company was founded in 2012 by three experts, Benoit Dashville, Thierry Cruanes, and Marcin Zukowski [11]. Snowflake runs on top of cloud infrastructure, such as AWS, Microsoft Azure, and Google's cloud platforms. It allows customers to store and analyze their data using the elasticity of the cloud, providing speed, ease of use, cost-effectiveness, and scalability. It is widely used for data warehousing, data lakes, and data engineering. It is designed to handle the complexities of modern data management processes. Not only that, but it also supports various data analytics applications, such as BI tools, ML/AI, and data science. Snowflake also revolutionized the pricing model by utilizing a "utilization model" that focuses on a client's consumption based on whether they're computing or storing data, making everything more flexible and elastic.

Key features of Snowflake include:

Scalability: Snowflake offers scalability through its multi-cluster shared data architecture, allowing for easy scaling up and down of resources as needed.
Cloud-Agnostic: Snowflake is available on all major cloud providers (AWS, GCP, AZURE) while maintaining the same user experience, allowing for easy integration with current cloud architecture.
Auto-scaling + Auto-Suspend: Snowflake automatically starts and stops clusters during resource-intensive processing and stops virtual warehouses when idle for cost and performance optimization.
Concurrency and Workload Separation: Snowflake's multi-cluster architecture separates workloads to eliminate concurrency issues and ensures that queries from one virtual warehouse will not affect another.
Zero Hardware + Software config: Snowflake does not require software installation or hardware config or commissioning, making it easy to set up and manage.
Semi-Structured Data: Snowflake's architecture allows for the storage of structured and semi-structured data through the use of VARIANT data types.
Security: Snowflake offers a wide range of security features, including network policies, authentication methods and access controls, to ensure secure data access and storage.

4) Google Bigquery

Google BigQuery is a fully-managed and serverless data warehouse provided by Google Cloud that helps organizations manage and analyze large amounts of data with built-in features such as machine learning, geospatial analysis, and business intelligence[12]. It allows businesses and organizations to easily store, ingest, store, analyze, and visualize large amounts of data. Bigquery is designed to handle up to petabyte-scale data and supports SQL queries for data analysis purposes. The platform also includes BigQuery ML, which allows businesses or users to train and execute machine learning models using their enterprise data without needing to move it around.

BigQuery integrates with various business intelligence tools and can be easily accessed through the cloud console, a command-line tool, and even APIs. It is also directly integrated with Google Cloud’s Identity and Access Management Service so that one can securely share data and analytics insights across organizations. With BigQuery, businesses only have to pay for data storing, querying, and streaming inserts. Loading and exporting data are absolutely free of charge.

3) Amazon Redshift

Amazon Redshift is a cloud-based data warehouse service that allows for the storage and analysis of large data sets. It is also useful for migrating LARGE databases. The service is fully managed and offers scalability and cost-effectiveness for storing and analyzing large amounts of data. It utilizes SQL to analyze structured and semi-structured data from a variety of sources, including data warehouses, operational databases, and data lakes, which are enabled by AWS-designed hardware and powered by AI & machine learning; due to this, it is able to deliver optimal cost-performance at any scale. The service also offers high-speed performance and efficient querying capabilities to assist in making business decisions.

Key features of Amazon Redshift include:

High Scalability: Redshift allows users to start with a very small amount of data and scale up to a petabyte or more as their data grows incrementally.
Query execution + Performance: Redshift uses columnar storage, advanced compression, and parallel query execution to deliver fast query performance on large data sets.
Pay-as-you-go pricing model: Redshift uses a pay-as-you-go pricing model and allows users to choose from a range of node types and sizes to optimize cost and performance.
Robust Security: Redshift integrates with AWS security services like AWS Identity and Access Management (IAM) and Amazon Virtual Private Cloud (VPC)—and more(learn more from here)—to keep data safe.
Integration: Redshift can be easily integrated with varios other services such as Datacoral, Etleap, Fivetran, SnapLogic,Stitch,Upsolver,Matillion—and more.
Monitoring + Management tools: Redshift has various management and monitoring tools, including the Redshift Management Console and Redshift Query Performance Insights, to help users manage and monitor their clusters in their data warehouse.

Conclusion

As the amount of data continues to grow at an unprecedented rate, the need for efficient data management and observability solutions has never been greater. But simply collecting and storing data won't cut it—it's the insights and value it can provide that truly matter. However, this can only be achieved if the data is high quality, up-to-date, and easily accessible. This is exactly where DataOps comes in—providing a powerful set of best practices and tools to improve collaboration, integration, and automation, allowing businesses to streamline their data pipelines, reduce costs and workload, and enhance data quality. Hence, by utilizing the tools mentioned above, businesses can minimize data-related expenses and extract maximum value from their data.

Don't let your data go to waste—harness its power with DataOps.

References

[1]. A. Dyck, R. Penners and H. Lichter, "Towards Definitions for Release Engineering and DevOps," 2015 IEEE/ACM 3rd International Workshop on Release Engineering, Florence, Italy, 2015, pp. 3-3, doi: 10.1109/RELENG.2015.10.

[2] Doyle, Kerry. “DataOps vs. MLOps: Streamline your data operations.” TechTarget, 15 February 2022, https://www.techtarget.com/searchitoperations/tip/DataOps-vs-MLOps-Streamline-your-data-operations. Accessed 12 January 2023.

[3] Danise, Amy, and Bruce Rogers. “Fivetran Innovates Data Integration Tools Market.” Forbes, 11 January 2022, https://www.forbes.com/sites/brucerogers/2022/01/11/fivetran-innovates-data-integration-tools-market/. Accessed 13 January 2023.

[4] Basu, Kirit. “What Is StreamSets? Data Engineering for DataOps.” StreamSets, 5 October 2015, https://streamsets.com/blog/what-is-streamsets/. Accessed 13 January 2023.

[5] Chand, Swatee. “What is Talend | Introduction to Talend ETL Tool.” Edureka, 29 November 2021, https://www.edureka.co/blog/what-is-talend-tool/#WhatIsTalend. Accessed 12 January 2023.

[6] “Delivering real-time data products to accelerate digital business [white paper].” K2View, https://www.k2view.com/hubfs/K2View%20Overview%202022.pdf. Accessed 13 January 2023.

[7] “Complete introduction to Alteryx.” GeeksforGeeks, 3 June 2022, https://www.geeksforgeeks.org/complete-introduction-to-alteryx/. Accessed 13 January 2023.

[8] “Apache Airflow: Use Cases, Architecture, and Best Practices.” Run:AI, https://www.run.ai/guides/machine-learning-operations/apache-airflow. Accessed 12 January 2023.

[9] “What is AWS Glue? - AWS Glue.” AWS Documentation, https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html. Accessed 13 January 2023.

[10] “About Databricks, founded by the original creators of Apache Spark™.” Databricks, https://www.databricks.com/company/about-us. Accessed 18 January 2023.

[11] “You're never too old to excel: How Snowflake thrives with 'dinosaur' cofounders and a 60-year-old CEO.” LinkedIn, 4 September 2019, https://www.linkedin.com/pulse/youre-never-too-old-excel-how-snowflake-thrives-dinosaur-anders/. Accessed 18 January 2023.

[12] “What is BigQuery?” Google Cloud, https://cloud.google.com/bigquery/docs/introduction. Accessed 18 January 2023.

DataOps 101: An Introduction to the Essential Approach of Data Management Operations and Observability

Pramit Marattha — Mon, 23 Jan 2023 04:49:26 +0000

In today's day and age, data has become a crucial asset for organizations across all kinds of industries. Industry after industry—from retail to e-commerce to manufacturing to accounting to insurance to healthcare to finance—uses data to fuel innovation, enhance operations, and make informed decisions. However, managing and utilizing data effectively is no easy task. That is exactly where the field of “DataOps” comes in. DataOps borrows concepts from DevOps and attempts to help organizations rapidly deliver the right data at a very fast pace. The traditional process for delivering data to the business can be really very slow and really time-consuming; therefore, DataOps aims to promote agility, flexibility, and the continuous delivery of fresh data.

In this article, we'll provide a comprehensive introduction and guide to DataOps, covering its essential key components, the benefits it offers, how it differs from DevOps and the best practices for implementing it. We'll also go over some of the potential challenges in implementing DataOps and provide resources for further reading on this vital data management operations strategy.

But first, let's define DataOps and explain why it's become such a crucial part of modern data management.

What is DataOps?

Data Operations, or DataOps for short, is used to describe a set of practices and processes that are designed to improve the collaboration, integration, and automation of data management operations and tasks [1]. These practices and processes include a focus on agile methodologies. It is intended to help organizations better manage their data pipelines, reduce the workload and time required to develop and deploy new data-driven applications and improve the quality of the data being used. DataOps is meant to eliminate barriers between data engineers, data scientists and data/business analysts—as well as other teams and departments within an organization—enabling them to work together more efficiently and effectively to manage and analyze data.

Many businesses and organizations have already adopted DataOps principles to make better use of their data and increase productivity [2]. Let's take a look at "Netflix" as an example; they have a large and very complex data environment, with data coming from multiple different sources, including subscriber accounts, viewing or streaming activity, and customer support inquiries. To manage this data effectively, Netflix has implemented DataOps practices and tools, such as automation, collaboration, and monitoring. Netflix has automated the data ingestion, automation and preparation processes, allowing it to quickly and accurately integrate data from multiple different sources and prepare it for analysis, which will help Netflix directly gain a better understanding of subscriber activity, behaviour and preferences, which in turn allows it to make better decisions about content/movie/show recommendations, their own marketing campaigns, and product development.

Why is DataOps important?

In today's fast-paced modern business world, DataOps plays a vital role in helping businesses and organizations stay ahead, as the ability to analyze data rapidly and precisely can provide them with a competitive advantage over others. DataOps simplifies and automates the complex process of collecting, storing, and analyzing data, making it more efficient, accurate, and relevant to the business's needs/requirements. This enables businesses to make better use of their data assets and derive more value from them. Overall, DataOps plays a key part in any organization's data management and data management operations strategy because it lets them use their fresh data assets to drive business growth and fresh new innovations.

DataOps empowers businesses and organizations to make better, faster decisions and get the most out of their data. It helps them extract valuable insights—and drive productivity as a result. With the right tools and a well-thought-out plan, (businesses) can make more informed timely decisions.[3] Nevertheless, a significant obstacle to data-driven initiatives is ensuring decision-makers have access to important data and know how to use it effectively[4]. DataOps helps bridge this gap and fosters collaboration between teams, thereby enabling organizations to deliver products faster and more effectively.

DataOps is a people-driven practice, meaning that it depends on the abilities and knowledge of the individuals. It is not a tool or application that can be bought and implemented without the required human resources. Instead, it necessitates a team of proficient data experts that can collaborate effectively and efficiently [7].

**Exploring the Key Differences Between DevOps and *DataOps***

DevOps is a set of practices that combines software development (Dev) and IT operations (Ops) to reduce system and application development lifecycles. DevOps has been defined as an organizational approach aimed at creating empathy and cross-functional collaboration [5]. It aims to establish an environment in which software development, building, testing, and release can occur more quickly, frequently, and consistently.

The main goal of DevOps is to improve the collaboration and communication between developers and operations teams and automate the build, test, and release service cycle and manage and monitor infrastructure and applications in production.

DevOps Lifecycle

DevOps lifecycle consists of several phases that are followed when developing and maintaining software applications.

Plan: The plan phase involves identifying the goals/objectives of the project and the resources that will be required to complete it.
Develop: Develop phase where the software is actually developed. This involves writing code, building mockups/prototypes, and testing the software to ensure it is functional.
Test: The test phase comes after the software has been developed; it must be tested to ensure it is error-free and will function as intended. This may include unit testing, integration testing—and other types of testing.
Deploy: Deploy phase is where the software or application is deployed to a production environment where end users can use it.
Maintain: Maintain phase is where the software will be maintained to guarantee it continues to work as expected. This could involve patch fixes, security upgrades and hotfixes to ensure the software runs smoothly indefinitely.

DataOps Lifecycle

The DataOps lifecycle typically consists of the following stages:

Ingest: Ingest stage involves extracting data from multiple different raw data sources and storing it in a centralized location, such as a data warehouse or data lake.
Prepare: Prepare stage is where data engineers and data scientists prepare the data for analysis by extracting, cleaning, and transforming it. This may involve tasks such as data deduplication, data integration/mining and feature extraction.
Model: The model **stage involves building AI/ML models and other statistical models to analyze and make predictions based on the data. Data scientists are typically responsible for this stage.
Visualize: Visualize stage involves creating charts, graphs and other visualizations to help others understand and interpret the data.
Deploy: Deploy stage is where the models and other data products developed in previous stages are deployed and made available to end users.
Observability: The observability stage involves monitoring and analyzing the performance of the data quality and ensuring that it meets the needs of the end users. This stage also involves collecting feedback and implementing improvements as needed.

To sum up, now that you know what sets the DataOps and DevOps lifecycles apart, DataOps seeks to optimize an organization's whole data lifecycle, from data ingestion and preparation through analysis and visualization. In contrast, DevOps is focused on enhancing the agility of the software development process through automation and integration. DataOps aims to improve the efficiency and effectiveness of data processing and utilization. It can be thought of as the function within an organization that controls the data journey from source to value[6].

Collaboration Across Teams for Data Delivery

DataOps is a collaborative effort within an organization, with many different teams of people working together to ensure that DataOps functions properly and delivers data value [3]. So, before the data is delivered to end users, it is subjected to a number of treatments and refinements from multiple teams. Data scientists first use their data science techniques, such as machine learning and deep learning to build models using software stacks such as Python or R and tools such as Spark or Tensorflow, among others, and the models are then transferred to data engineers, who collect and manage the data used to train and evaluate these models, while data developers and data architects create complete applications that include the models. The data governance team then implements data access controls for training and benchmarking purposes, while the operations team ( "Ops") is in charge of putting everything together and making it available to end users.

Key Components of DataOps

DataOps involves several key components which work together to improve data management processes. These includes:

Continuous integration and Continuous delivery (CI/CD)

Continuous integration and Continuous delivery (CI/CD) is a practice that involves frequently integrating and testing code changes and then quickly and efficiently pushing those changes to production environments. In DataOps, CI/CD plays a crucial role in ensuring the accuracy and consistency of data as it is integrated and delivered to the appropriate people/systems. By constantly developing, building, automating and testing data changes and then quickly delivering them to production without any downtime, DataOps teams can minimize the risk of errors and ensure that data is delivered in a timely and reliable manner.

Data governance

The process of establishing policies, procedures, and standards for managing data assets, as well as an organizational structure to support enterprise data management, is known as data governance. Data governance in DataOps helps to ensure that data is collected, stored and used in a consistent and in ethical manner.

Data quality management and measurement

Data quality management and measurement involve identifying, correcting, and preventing any errors or inconsistencies in data. It helps ensure that the data being used is fully reliable—and accurate. This is critical because poor data quality can lead to incorrect or misleading insights and decisions, which can have serious consequences.

Data Orchestration

Data orchestration refers to the management and coordination of data processing tasks in a data pipeline. It involves specifying and scheduling how tasks will be completed, as well as dealing with errors and how tasks interact with one another. Data orchestration is critical in DataOps for automating and optimizing the flow of data through the pipeline. This can include tasks such as extracting data from various sources, transforming and cleaning the data, and loading it into a target system for analysis or reporting purposes.

DataOps Observability

As we’ve already discussed what DataOps is, let’s briefly review it. DataOps is a collection of best practices and technology used to manage and develop data products, optimize data management processes, improve quality, speed, and collaboration and promote continuous improvement. DataOps is based on the same principles and practices as DevOps. Still, it has taken longer to become fully matured because data is constantly changing and can be more fragile than software applications/infrastructures. For example, let's suppose that if a software application goes down, it can be easily restored without significant impact, but if data becomes corrupted, it may have serious consequences. This is the exact reason why DataOps has taken longer to get off the ground compared to DevOps.

To ensure that data performs optimally and meets desired standards for quality, reliability, and efficiency, it is important to implement DataOps observability. This involves regularly observing and monitoring data and using the insights gained to make informed decisions. DataOps observability is a newer concept, but the practice of observability itself has a long history in the DevOps world. For example, observability platforms/solutions such as AppDynamics and Splunk help software engineers improve application reliability and reduce site/app downtime.

DataOps observability is not just limited to testing and monitoring data quality and the data pipeline. It also includes monitoring the health of the machine learning models, analyzing the application security measures to data infrastructure, tracking KPI and business monitoring. In other words, it covers a wide range of areas beyond just monitoring the health of data quality and data pipeline.

DataOps observability is a somewhat ambiguous concept that is interpreted differently in the data community. Still, in essence, it refers to an organization's/businesses’ ability to fully understand the health of the data. To sum it up, DataOps observability must address a few key areas: data quality and data pipeline reliability. Data quality is important to business users who want high-quality data they can trust. Data pipeline reliability is critical to data engineers and scientists, who need their data pipeline to run smoothly. Also, In addition to these two components, DataOps observability includes BizOps, which tracks/monitors the health and KPI of the business, as well as monitors the usage and the cost of the data. This is exactly where Chaos Genius fits in. Offering a complete observability solution, it facilitates businesses and organizations in testing the resilience and reliability of data, which can directly help businesses to improve their spending and boost their performance.

To create a successful data product, businesses should focus on three key areas: data governance, data access and security, and DataOps and quality.

Data governance involves understanding where the data comes from, while data access and security ensure that the data is being used in an appropriate and secure manner. Finally, DataOps and Quality involve automation, orchestration, CI/CD, configuration management and observability to ensure that the data product is high quality. The unification of these use cases is essential for the success of the data product.

In a nutshell, "DataOps Observability" refers to the ability to monitor and understand the various processes and systems involved in data management, with the main goal of ensuring the reliability, trustworthiness, and business value of the data. It involves monitoring and analyzing data pipelines, ensuring the quality of the data and demonstrating the business value of the data through metrics like financial and operational efficiency. DataOps observability allows businesses to improve the efficiency of their data management processes and make better use of their data assets. It helps to ensure that data is accurate, reliable, and easily accessible, enabling businesses and organizations to make data-driven decisions and drive business value.

Implementing DataOps

Implementing DataOps involves following a number of steps to ensure that data is collected, stored, and used in a way that supports business goals/objectives. This starts by identifying the data requirements and specifying the sources and types of data needed. A data governance framework is then established to ensure that data is collected, stored and used in a consistent and compliant manner. Data pipelines are designed and implemented to extract, transform, and load data from various sources into a centralized repository, and data quality checks and monitoring are put in place to ensure the accuracy, completeness and consistency of the data. To support a data-driven culture, it is crucial to build a collaborative and cross-functional team and establish a focus on data literacy, continuous improvement, and data-driven decision-making. Finally, it is important to continuously monitor and optimize the DataOps processes to improve efficiency, effectiveness, and agility.

List of Top DataOps tools and platforms available

One of the key components of DataOps is the use of specialized tools to manage and automate the flow of data. Tools can help with tasks such as scheduling and monitoring the execution of data pipelines, extracting, transforming and cleaning data, and integrating data from multiple different sources. There are various different DataOps tools available on the market, and the best choice will depend on your specific needs/requirements. Some tools are designed for general-purpose data integration and transformation, while others are more specialized for specific types of data or use cases. Here are some of the TOP Trending and Popular DataOps tools currently available on the market.

Apache Airflow:

Apache airflow is an open-source tool that is used for scheduling, monitoring and managing the execution of data pipelines. It provides a simple, intuitive interface for defining and organizing tasks, and it can be extended with custom plugins to support a wide range of data sources and destinations.

Databricks

Databricks is a cloud-based platform for data engineering, data science and AI/ML. It is built on top of the Apache Spark big data processing engine and offers a variety of tools for working with large amounts of data. Databricks' collaborative workspace is a great way for groups to collaborate on data projects together in real time. It provides a fully web-based notebook-like environment for writing and executing code, as well as data exploration and visualization tools. Databricks consist of connectors for common data sources and destinations, a library of pre-built transformations and functions, and support for different programming languages.

Snowflake

Snowflake is not a DataOps tool per se, it's a platform that can be used as a foundation for DataOps. Snowflake is a cloud-based data storage and analytics platform that is widely used for data warehousing, data lakes and data engineering. It is designed to handle the complexities of modern data management processes, such as data integration, data quality, data security, and data governance and to support a variety of data analytics applications, such as BI tools, ML and data science. Snowflake can also be used to manage the flow of data from various sources to the data warehouse, where it can be transformed, cleansed and optimized accordingly for analysis purposes. Snowflake’s architecture is designed to support high levels of concurrency, scalability and performance, making it well-suited for handling large amounts of data in real time. It also provides a number of features that supports data governance and security, such as data lineage, masking and auditing, which can be a very important consideration in DataOps environments.

Fivetran

Fivetran is also a cloud-based service that simplifies the process of transferring data between various sources and destinations(including Snowflake). It includes a range of connectors for popular data sources and destinations, including databases, cloud storage, SaaS applications—and more. One of the main features of Fivetran is its ability to support real-time synchronization and incremental updates, which means that it can continuously transfer new and updated data. This makes it ideal for use in scenarios where data needs to be kept up-to-date in near real-time. Fivetran has the ability to transfer data, but it also has a number of tools for managing and monitoring data pipelines. These tools include a web-based dashboard for tracking the status of data transfers and alerts for detecting issues and fixing 'em.

Talend

Talend is a commercial data integration platform that offers a wide range of tools for extracting, transforming, and loading (ETL) data. It includes an awesome and highly interactive graphical user interface (GUI) for building data pipelines and a library of pre-built connectors and transformations that can be used to integrate data from a wide range of sources and destinations. One of the main key features of Talend is its support for a wide range of data sources and destinations, including databases, cloud storage, SaaS applications—and more. It also includes support for popular programming languages, which allows users to write custom transformations and integrations. Talend also provides a range of tools for data governance, data quality, and data management, including support for tracking and managing data lineage.

Learn more about the Top DataOps tools available on the market in 2023.

Future of DataOps

DataOps is constantly evolving in response to emerging technologies and changing business needs. According to a report by MarketBiz, the global DataOps platform market is expected to experience significant growth over the forecast period of 2023 - 2032, with a projected value of $7,091.38 million. This growth is driven by the increasing demand for real-time data insights, the adoption of cloud-based solutions and the rising popularity of Agile and DevOps-related practices. The DataOps platform market is also anticipated to see growth in various regions, including North America, Europe, Asia Pacific, Latin America, the Middle East—and Africa. The market is projected to reach a value of $7,091.38 million, up from $1,150 million in 2022, with a compound annual growth rate of 22.4%.

The future of DataOps looks VERY bright with the current adoption of automation and artificial intelligence (AI). Automating data-related tasks and using AI/ML to analyze data allows businesses to reduce the time and resources needed for data management, leading to more efficient and accurate analysis. Another main key factor that will contribute to the future success of DataOps is the growing importance of data governance. As organizations collect and use more data, it is crucial to have proper controls in place to ensure data privacy/security. DataOps practices can help businesses establish and maintain effective data governance.

Overall, the future of DataOps is expected to see continued growth and evolution as businesses and organizations seek to optimize and leverage data-driven insights to drive their success.

DataOps in Action!

Previously, we discussed how Netflix uses DataOps to streamline its data management operations. To have a complete understanding of how DataOps is used in practice, let's examine a second case study. Suppose a leading online store/retailer decides to use DataOps to enhance their sales forecasting procedure. Previously, the retailer had difficulty making accurate sales forecasts due to the complexity of their data environment and the rigorous manual and laborious processes they had to go through to prepare and analyze data. To address these challenges, they formed a DataOps team that included data engineers, data scientists, and data/business analysts.

The team then implemented an automated data ingestion and transformation pipeline utilizing a market-leading data integration platform. This allowed them to swiftly and efficiently gather sales data from multiple sources, including online transactions, in-store purchases, user product preferences, user activity, and market research. The data was then cleaned, transformed, and validated using a series of predefined rules and procedures to ensure that it was ready for final analysis. The team then collaborated with data scientists to create and deploy AI/ML models that could predict future sales trends. These models were trained on historical product sales data and were designed to learn and adapt over time, becoming more accurate as more data was supplied to them. And after that, the team worked with data/business analysts to integrate the sales forecasting technique into the retailer's overall decision-making processes. This included making dashboards and reports that showed the outputs of the forecasting models and how they worked, as well as integrating the forecasts into the retailer's systems for managing product inventory and setting up product prices.

The results of the DataOps implementation were impressive. The store was able to track sales more accurately, which significantly improved managing the product inventory and aided in making smarter business decisions. Overall, the DataOps approach helped the retailer/store to make better understand and act on the data they had, leading to improved efficiency, accuracy, and agility.

Resources for learning more about DataOps

To learn more about DataOps, there are a number of resources available, including books, articles, online courses, videos and events/podcasts. Some recommendations(personal preference) includes:

Books:

“Creating a Data-Driven Enterprise with DataOps” by Ashish Thusoo and Joydeep Sen Sarma
"Practical DataOps: Delivering Agile Data Science at Scale" by Harvinder Atwal
“Data Teams: A Unified Management Model for Successful Data-Focused Teams” by Jesse Anderson
“Managing Data in Motion Data Integration Best Practice Techniques and Technologies” by April Reeve
“The DataOps Cookbook” by Christopher Bergh

Articles:

There are many articles available online that different cover aspects of DataOps.

Videos: There are several videos, online courses, and training courses available for those interested in learning more about DataOps.

Conclusion

DataOps is a crucial approach to data management operations that enables businesses to improve the speed, quality, and reliability of data processing and analysis. It facilitates collaboration and communication and accelerates the delivery of insights and results at a rapid pace. While implementing DataOps can present challenges, following best practices and communicating the benefits to stakeholders can help ensure a successful adoption. As emerging technologies continue to change the industry, we may anticipate DataOps to evolve and potentially expand into more fields. Above all, DataOps is a people-driven discipline, meaning that it depends on the abilities and knowledge of the individuals. It is not a tool or application that can be bought and implemented without the required human resources. Instead, it necessitates a team of proficient data experts that can collaborate effectively and efficiently.

References

[1] Swanson, Brittany-Marie. “What is DataOps? Everything You Need to Know.” Oracle Blogs, 12 March 2018, https://blogs.oracle.com/ai-and-datascience/post/what-is-dataops-everything-you-need-to-know. Accessed 7 January 2023.

[2] DataOps and the future of data management.” MIT Technology Review, 24 September 2019, https://www.technologyreview.com/2019/09/24/132897/dataops-and-the-future-of-data-management/. Accessed 6 January 2023.

[3] Valentine, Crystal, and William Merchan. “DataOps: An Agile Methodology for Data-Driven Organizations.” Oracle, https://www.oracle.com/a/ocom/docs/oracle-ds-data-ops-map-r.pdf. Accessed 6 January 2023.

[4] Anderson, C. (2019). Creating a Data-Driven Enterprise with DataOps. O'Reilly Media, Inc. Retrieved from https://www.oreilly.com/library/view/creating-a-data-driven/9781492049227/ Accessed 6 January 2023.

[5] A. Dyck, R. Penners and H. Lichter, "Towards Definitions for Release Engineering and DevOps," 2015 IEEE/ACM 3rd International Workshop on Release Engineering.

[6] Saurabh, Saket. “What is DataOps? | Platform for the Machine Learning Age.” Nexla, https://www.nexla.com/define-dataops/. Accessed 7 January 2023.

[7] Heudecker, Nick. “Hyping DataOps - Nick Heudecker.” Gartner Blog Network, 31 July 2018, https://blogs.gartner.com/nick-heudecker/hyping-dataops/. Accessed 7 January 2023.

Data Engineering and DataOps: A Beginner's Guide to Building Data Solutions and Solving Real-World Challenges

Pramit Marattha — Fri, 20 Jan 2023 05:33:25 +0000

Introduction

Data engineering is the process of designing, building, maintaining, and running systems and infrastructure for storing, processing, and analyzing large, complex datasets. It is a field that has recently become much more important because of the growth of “big data” and the growing reliance on business models that are driven by data. In fact, according to a report by Gensigma, demand for data engineers has grown so quickly that an organization needs at least 10 data engineers for every three data scientists. The global market for big data and data engineering services is also seeing significant growth, with estimates ranging from a whopping 18% to 31% increase on a per-year basis from 2017 to 2025. This shows how important it is to learn and improve data engineering skills since it can be a rewarding, high-paying, and in-demand field in the tech industry right now.

This particular innovation was primarily driven by the FAANG (now MAANGO) companies ( Facebook (Meta), Amazon, Apple, Netflix, Google, and Oracle ), who have adopted data-driven business models and built advanced data infrastructure to support them. These companies have put a lot of money and time into hiring and developing data engineering talent and technologies. They have also helped create new tools and ways to manage and analyze data at a large scale.

So, nowadays, companies and businesses rely heavily on data to improve their products and services by understanding user actions and behavior. Because of this, they “have to” heavily rely on data engineers to design + build + maintain the infrastructure and systems that enable the collection, storage, and analysis of large and complex data sets. Data engineering has therefore become a crucial field, with skilled data engineers playing a key role in driving data-driven innovations. In this article, we’ll look into the different parts and processes involved in data engineering, including DataOps, and how they help companies and businesses use the power of data to make their products and services better.

Collecting and Storing Data

In today’s digital world, virtually every online action you perform generates information that is collected and held onto by businesses, companies, or corporations. This includes visiting web apps and websites, ordering products or merchandise, using apps, and more. The MAIN question is, where do these companies keep all of this data? The answer is in a database management system (DBMS).

There are two main types of DBMS:

Relational databases Relational databases store data in a way that looks like a spreadsheet format, with rows + columns. These are often used to store structured data, such as customer orders/inventory. A few perfect examples of a relational databases are MySQL, PostgreSQL, MariaDB, Microsoft SQL Server, and Oracle Database. To build a relational database, we need to make a “data model” that shows how the different tables work together. This helps to understand the entire picture and makes it easier to analyze the data which would make the analysis a great deal and a whole lott!! less complicated and difficult to do so.

Non-relational databases (also referred to as NoSQL databases)On the other hand, NoSQL (non-relational) databases store data in varied formats, like key-value pairs, documents, and graphs. It is often used for handling large amounts of unstructured or semi-structured data, such as that generated by social media + online giants. They are also well-suited for applications that require high levels of flexibility and scalability.

The type of database a company uses depends on its specific needs. There are many different companies that make use of both relational and non-relational db to store and manage their data. For example, Amazon uses both relational and non-relational database(like cassandra + DynamoDB) to store customer, product catalog, and order, and ads info. Google also uses both types of databases, with relational databases (like MySQL) and non-relational databases (like Bigtable and Cloud Datastore). Facebook, Twitter, Netflix, Uber, Airbnb, LinkedIn, Indeed and Dropbox are also among the other companies that make use both relational and non-relational databases to store and manage their data. These databases are used to store and manage a wide variety of data, including user data, product and service data, and business-critical information.

Using SQL to Communicate with Databases

We can make use of a scripting language like Structured Query Language(SQL) to extract all the necessary information from a database. SQL allows us to communicate with the database easily and helps to retrieve the desired data by passing very simple commands.

For example(as shown in the screenshot below), we can use commands like:

SELECT * FROM table_name LIMIT 5

This particular command retrieves the first five rows from a table(of 91 rows). SQL also allows us to perform various different kinds of operations, such as inserting, updating, and deleting data directly from the database itself. Learn more about SQL form here.

Using Programming Languages with Databases

In addition to Structured Query Language(SQL), we can also use a variety of different programming languages, such as Python, Java, JavaScript, R, Julia, Scala, or any other programming language as long as it supports a basic database connection and functions to perform all of those operations, to connect to databases and perform more advanced query operations on the data. This gives us greater flexibility and allows us to apply custom-created logic to the data.

The Data Engineering Process

Once the data is stored in a database, the next step is to use it to solve complex business problems. This can be achieved by creating dashboard metrics, machine learning models, and various other types of solutions. The process of going from raw data in a database to a final solution is known as “data engineering.” This “data engineering” process, also known as DataOps, usually may consist of several steps and can be different from company to company depending on its specific needs as well as requirements.

Essential Role of OLTP and OLAP in Data Engineering

Let’s skip ahead to the earlier section now that you understand what “data engineering” is.

Relational databases are designed for faster reading, writing, and updating of data, rather than in-depth analysis. This means that if you try to run a large analytics query on a relational database, it may not be able to handle the workload and could potentially crash. In order to gain insights from data, we need a different type of system that is optimized for analytics work. This is where OLAP (Online Analytical Processing) comes in. But wait!! So what is OLTP and OLAP??

Online Transaction Processing (OLTP)

Online Transaction Processing (OLTP) is a type of database system that is designed to support high-concurrency, data-intensive transactions. It is typically used to handle large volumes of data that are constantly being inserted + updated + deleted, such as in a retail or financial application. OLTP systems are typically implemented using a Relational Database Management System and use Structured Query Language (SQL) for data manipulation and query processing. Learn more from here.

Online Analytical Processing (OLAP)

On the other hand, Online Analytical Processing (OLAP) is a type of database system that is designed for fast querying and analysis of data. It is typically used to support business intelligence(BI) and decision-making activities, such as data mining, data analysis, statistical analysis, and reporting. OLAP systems are designed to support complex queries and calculations on large data sets, often involving aggregations and roll-ups of data across multiple dimensions. Learn more from here.

Moving Data from OLTP to OLAP: ETL

To analyze the data that is stored in an OLTP system, such as a Postgres or MySQL database, we need to transfer it to an OLAP system or a Data Warehouse like Snowflake.

This exact process is called ETL (extract, transform, load).

ETL involves extracting data from one or multiple sources, transforming it based on business logic or the data warehouse design, and then loading it onto a one specific target location. Learn more about ETL from here.

Traditional and Modern “ETL” Approaches

Traditionally, ETL pipelines were developed through the laborious process of writing them from absolutely scratch. However, newer approaches and tools are constantly being developed, released, and made easily available for purchase on the market. So, for instance, rather than developing a complete ETL pipeline from scratch, you can use a platform and tools like AWS Glue and Fivetran which provides a fully managed environment to Extract, load, and transform data in the data warehouse based on your specific requirements.

These particular tools are designed to save you the time and effort of having to manually write an entire ETL pipeline from absolute scratch. There are numerous tools available on the market, but it is important not to become TOO attached to any one of them because they may come and go. However, the fundamental concepts, such as understanding query languages and data processing systems like OLTP and OLAP, will remain the same forever.

The Data Processing Dilemma: Batch vs. Real-Time Processing

Different businesses, companies, and people have different requirements. Some of them — those businesses and companies — want to view that data in real time, while others want to view their data only once (depending upon their use cases and requirements); Therefore, it is becoming increasingly important to carefully select the right processing system to manage and make use of that particular data. So in general, we have two processing techniques:

1). Batch processing

2). Real-time processing

Batch processing involves persisting data as it comes in through events. For example, let's say A company named “Awesome” operates a simple e-commerce website that sells merchandise. The company uses batch processing to periodically extract data from its transactional DB and load it into a data warehouse. The data warehouse is used to perform data analysis and generate reports on customer behavior, sales, trends, and other business metrics; this is the perfect example of batch processing.

Whereas, Real-time processing involves persistently storing data as it comes in through events in real-time. For example, Companies like Uber and In-Drive use GPS trackers in their fleets of vehicles. Every vehicle’s location, speed, and other data are constantly being sent to a centralized server by the GPS units installed in them. So, the real-time processing system set up by these companies analyzes the data from the GPS units in near real-time. This information is used to give passengers up-to-date updates on things like vehicle locations and expected arrival times.

Processing Large Amounts of Data

For small amounts of data, it is possible to process it on a single computer. However, when dealing with HUGE amounts of data, multiple computers(processing powerhouse) are needed to divide and process the data in chunks and combine the final output.

There are several frameworks available for batch processing, such as Hadoop, Apache Storm, and DataTorrent RTS.

For real-time streaming, we have other frameworks and tools like Apache Kafka, ActiveMQ, and AWS Kinesis.

Choosing the right processing system depends on the specific needs and requirements. So, by understanding the difference between OLTP and OLAP and the options for batch and real-time processing, you can select the right tools and technology to build a solution that meets your exact requirements.

Big data landscape and cloud computing

The big data landscape is filled with various tools/technology for multiple different types of work they do and issues they solve. However, processing large amounts of data requires a powerful system, such as big data-crunching machines like supercomputers. In the past, companies and businesses would build their own servers and maintain them in a local data center. This often resulted in multiple hardware failures and issues requiring maintenance and software upgrades.

Benefits of Moving to the Cloud

Many businesses and companies are moving and transitioning their entire operations to the cloud to escape headaches associated with hardware breakdowns and regular software updates (as we mentioned earlier). Because of this, companies only have to pay for the resources that they really use, and they can scale their servers to meet any demand. Cloud service providers also provide several different kinds of services to manage large amounts of data and ease the process of storing and processing data, making the entire process much more manageable. According to a Gartner cloud computing infrastructure ranking, the top three cloud platform providers are Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure.

Modern Data Stack and Data Engineering Industry

Once a business or company has its architecture running on a cloud platform and has established ETL pipelines and a data warehouse, they can use this data for analytics and machine learning applications. After that, data engineers + AI/ML engineers will be able to create and implement machine learning models in production, allowing the company to develop and obtain deep insights.

Problems and Solutions in the Data Engineering Industry (The Emergence of the Modern Data Stack)

The field of data engineering is growing rapidly and with it comes a wide range of MASSIVE challenges. One common issue is the difficulty of migrating data from local(on-premise) systems to cloud warehouses, which can get very complex and time-consuming. Many businesses encounter problems during this process and try to create solutions for them. When one company faces a problem, it is likely that other companies might encounter the same kind of issues and difficulties. This creates opportunities for companies/businesses to identify gaps in the market and develop new tools to address these needs. This is exactly what led to the development of the “Modern Data Stack”.

Conclusion

Data engineering is a very important field that plays a VITAL role in helping out businesses, companies, startups, and organizations break down valuable insights from the data they have. By mastering the skills of data gathering, storage, and analysis skills, data engineers can solve real-world business challenges and drive business growth by an order of magnitude!! Whether you’re just starting in data engineering or looking to advance your career, it’s important to continuously learn and improve your skills to stay competitive in this rapidly evolving field. You can become a top-performing engineer and make a meaningful contribution to the world if you have the correct tools, resources, and right mindset.

Unleash the Power of Chaos Genius to Reduce Data Warehouse Costs and Boost Data ROI

Pramit Marattha — Mon, 19 Dec 2022 04:32:45 +0000

Introduction

Big Data and Cloud Computing have significantly impacted various industries. As a business owner, you may be considering using data to guide your decision on where and how to allocate resources in order to give your business a competitive edge. However, it's important to consider the cost and time investment required for implementing a data-driven strategy, as it can be really very expensive and time-consuming. Before committing to this approach it’s crucial to carefully assess the potential return on investment (ROI) to ensure that it is worthwhile for your business. Keep in mind that you may have a limited amount of budget, so it's important to maximize the efficiency of your data spending. That’s exactly where Chaos Genius comes in, offering a solution to help you navigate the complexities of data-driven decision-making while maximizing the ROI.

Chaos Genius is a DataOps Observability Platform that helps businesses reduce costs and optimize query performance for their Data Warehouses starting with Snowflake. The platform provides in-depth visibility into Snowflake utilization, allowing businesses and companies to better understand their data usage and make informed decisions about data warehouse optimization performance and costs. This is especially valuable for businesses and companies who are seeking to improve and streamline their data analysis and data management processes.

In this article, we examine the challenges in monitoring data warehouse costs and how we can automate and optimize this process by utilizing the power of a DataOps Observability platform like Chaos Genius.

Costs associated with Data Warehouse

Data warehouses are one of the most complex yet vital components to any kind of business. They hold all the critical data that allows companies to make decisions about their future. Unfortunately, data warehouses can also be very expensive to maintain and run. The major costs associated with a data warehouse include:

Hiring skilled professionals for design, implementation, and integration
Setting up the infrastructure for hosting the database server(s)
Building an ETL (extract, transform, load) process that can ingest transactional feeds efficiently into the database(s)
Developing application code to query these databases and generate reports

In addition to all this, there are also other hidden costs such as maintenance, support, upgrades, etc., which may not be obvious at first glance but add up over time if not accounted for properly!

With so many moving parts involved with a data warehouse, it can be difficult for an organization without specialized knowledge or experience in data warehouse optimization to know where they should begin when trying to optimize costs.

Cloud-Based Data Warehouses

Cloud based data warehouses offer many benefits, but they also come with their own set of challenges. Most businesses use simple dashboards to visually track costs, but these in-built tools often lack support for optimization and query performance tuning. They may also not offer or provide real-time alerts or other monitoring options making it difficult to keep a tab on costs.

There are various types of data warehouses and they come with different cost structures. This can make it difficult to compare one data warehouse to another or even know what your own business’s data warehouse costs are.

Manual Cost Optimization

The first step in optimizing data warehouse costs is understanding what your current costs are. This will give you a baseline to compare against as you make changes to your infrastructure. To do this, you'll need to get a sense of what kind of resources are being used by each data warehouse (e.g., storage, CPU time) and how much those resources are costing in total each month. You can do this by looking at reports from your cloud provider or other tools they provide that allow you to see how much capacity is being used by each service on an hourly basis. By understanding your current data warehouse cost, you can identify which services are driving up the cost and where you can optimize and reduce expenses.

Issues with manual cost management

The issues faced with manual data warehouse cost management include the following:

Massive time consumption: It is time-consuming as it involves manually collecting and recording data from different departments. It also requires a lot of effort to ensure accuracy and consistency in tracking the costs incurred during different stages and aspects of the project.
Does not scale: A manual approach does not scale well as the size of your company grows. As your business grows and becomes more complex, it becomes harder to keep track of all your expenses.
Error-prone and unreliable estimates: Manually tracking costs can lead to errors that may not be detected until the end of the project or even after its completion. These errors could result in incorrect reporting, which would cause problems when making financial decisions based on inaccurate information.
Lack of transparency: Manual tracking usually takes place behind closed doors, leaving stakeholders in the dark about how funds are being spent on a project. This makes them less likely to approve future funding requests or question spending decisions made by management.

Chaos Genius: An Effective Tool to Analyze Data Warehouse Costs

Chaos Genius's Snowflake observability platform utilizes machine learning and artificial intelligence(ML/AI) to analyze data in your Snowflake cloud data warehouse and provide enhanced metrics and cost monitoring. With this service, you can delve into your credit consumption data, detect anomalies, create smart alerts, and automatically get recommendations to optimize performance. By using this tool, you can improve query performance, gain insight into your data warehouse cost and reduce costs related to your Snowflake cloud data warehouse.

By analyzing your Snowflake queries, databases, and resource usage, Chaos Genius enables you to enhance the efficiency of your Snowflake deployment and reduce cost expenditures by 10% to 30%.

The pricing of Chaos Genius is quite affordable, with three tiers. The first tier is free, and the other two are business-oriented plans intended for companies with larger Snowflake spends.

Conclusion

The demand for cloud-based data warehouses has skyrocketed. With the massive amounts of data being generated every single day, data warehouses have become an integral part of any business intelligence or analytics platform. To optimize and reduce data warehouse costs, Chaos Genius harnesses the power of AI and ML to immediately supply advice on optimal strategies and course corrections for your Data Warehouse operations. More importantly, it has the potential to increase a business's margins as it saves on data warehouse costs while ensuring high performance.

DEV Community: Chaos Genius

HOW TO: use Hoppscotch.io to interact with Snowflake API ❄️+🛸

Prerequisites for Snowflake + Hoppscotch integration (❄️+ 🛸)

Getting Started with Snowflake API in Hoppscotch

What is Hoppscotch?

Setting up Hoppscotch + Configuring Workspace/Collection

Understanding the Snowflake API

1) /api/v2/statements

2) /api/v2/statements/{{statementHandle}}

3) /api/v2/statements/{{statementHandle}}/cancel

Step by Step guide to Authorizing Snowflake API Requests

Using JWT key pair authorization

Using OAuth authorization

Executing SQL Statements with the Snowflake API

Checking the Status of Statement Execution

Canceling Statement Execution

Conclusion

FAQs

Snowflake Views Vs. Materialized Views: What's the Difference?

What Is a View and What Are the Different Types of Snowflake Views?

What are the types of Views in Snowflake?

What is a Non-Materialized View (Snowflake views)?

What are Snowflake Materialized Views?

How to Create a Materialized View?

What are the benefits & limitations of Using a Snowflake Materialized View?

What are the key differences between Snowflake Views and Materialized Views?

What are the cost differences between Snowflake views and Snowflake materialized views?

What are Snowflake Secure Views?

Conclusion

3 step guide to creating Snowflake Clone Table using Zero Copy Clone

How to Clone Table in Snowflake Using Zero Copy Clone?

Creating a Sample Table

Cloning the Sample Table

Understanding Table-Level Storage

Conclusion

Snowflake Roles and Access Control: What You Need to Know 101

Overview of Snowflake Roles & Access Control

Access Control Framework in Snowflake

Key elements of Snowflake access control framework

Understanding Access Control and its Relationships in Snowflake

What are Securable Objects in Snowflake?

What are Snowflake Roles?

What types of Roles are available in Snowflake?

1) System-defined roles

2) Custom Roles

What is Privileges in Snowflake ?

Understanding Snowflake Roles Hierarchy and Privileges

Creating a User in Snowflake: Step-by-Step Guide

Creating/Assigning Snowflake Roles and Privilege to Users: Step-by-Step Guide

Role Hierarchy in Snowflake

Discretionary Access Control (DAC)

Roles-based Access Control (RBAC)

Conclusion

Snowflake Zero Copy Clone 101 - An Essential Guide 2023

Introduction

What is Snowflake zero copy clone?

Use-cases of Snowflake zero copy clone

How does Snowflake zero copy clone work?

What are the benefits of Snowflake zero copy clone?

What are the limitations of Snowflake zero copy clone?

Which are the objects supported in Snowflake zero copy clone?

How do access control works with cloned objects in Snowflake?

What are the account-level objects not supported in Snowflake zero copy clone?

Conclusion

How to use Snowflake Time Travel to Recover Deleted Data?

Introduction

What is Snowflake Time Travel?

What are the benefits of Snowflake Time Travel?

Data Retention Period in Snowflake Time Travel

Setting the Data Retention Period for Snowflake Time Travel

Set and display 90-day time travel at the account level:

Set and display 70-day time travel at the database level:

Set and display 50-day time travel at the schema level:

Set and display 40-day time travel at the table level:

How to Enable or Disable Snowflake Time Travel?

Disable Snowflake Time Travel for the account (level).

How Snowflake Time Travel Works in Snowflake Backup and Recovery?

Querying Historical Data in Snowflake

Cloning Objects with Snowflake Time Travel

Recovering Objects with Snowflake Time Travel

2) /api/v2/statements/`{{statementHandle}}`

3) /api/v2/statements/`{{statementHandle}}`/cancel

**Exploring the Key Differences Between DevOps and *DataOps***