Warning: Attempt to read property "ID" on null in /nas/content/live/stemmadocs/wp-content/plugins/eazydocs/templates/onepage/default-layout.php on line 15

Warning: Attempt to read property "ID" on null in /nas/content/live/stemmadocs/wp-content/plugins/eazydocs/templates/onepage/default-layout.php on line 84

Using the Stemma Bot in Slack

Stemma’s goal is to capture information about data no matter where it happens. Slack is increasingly where conversations about data exist, and Stemma has a super simple integration that allows you to pin Slack conversations to tables.

Here’s how it works. If someone asks a great question, or posts a great answer about a piece of data in Slack, you can open a thread and tag the Stemma bot:748

You can now go to the Table Detail page in the Stemma UI to see the linked conversations:1186

Where can you use the Slack bot?

You can use it in public or private channels. You just need to invite the bot into the channel. It cannot be used in direct messages.

Some other things to keep in mind about the integration:

  • You must respond in a thread
  • You need to give a precise definition. If a table is only present in one schema, “@Stemma table_name” will work just fine. But if it’s present in multiple schemas, you’ll need to use “@Stemma schema.table_name”

If you have any feedback or questions about the Slack integration, please reach out to [email protected]!

Getting Started

What is Stemma?

Stemma is a data catalog built for users of the modern data stack. It uses automation and workflow integration to help data owners get documentation into the catalog, while making it easy for end users to discover and trust data.

Tools for the Data Team to Document Data

Auto-generated metadata handles basic stewardship so no asset starts blank. Data that is automatically generated includes:

Slack integration to capture context where and when it is expressed

WYSIWYG editor to easily annotate assets with text, images, and existing docs

Tools for Analysts to Find Data

Track data lineage

Explore and contribute to the Glossary

Browse tables:

  • By source type
  • By tags

Search and filter for columns

Tools To Keep Data Owners in Contact with Users

How to be Successful with Stemma

Kickstart documentation with automated data stewardship

Manage data changes with advanced lineage tools

Enhance existing user workflows by integrating the tools you already use


Integrating Your Applications with Stemma: What We Need from You

What we need to get you up and running depends on the applications you will be integrating with Stemma, but in general you will need to provide the information and do the tasks for Slack, SSO, and your dashboard and storage applications.

Instructions are on the following pages.

Information Stemma Extracts Automatically

This page shows what information you can expect to see in Stemma after you have integrated all of your applications.

Does Stemma Read My Data?

In general, Stemma reads metadata (queries, who ran them, names of columns or charts, etc.), not the data itself.

  • For dashboards, Stemma reads only metadata and doesn’t have access to numbers, or read charts or reports within the dashboarding system.
  • For tables the same is true, with one exception: if you choose to use the column statistics feature, Stemma reads the data to obtain that information.

For more information:

Next Step

Once you have integrated your applications into Stemma and everything is tested and working, you need to plan the rollout to your users.

Athena Integration

Stemma’s Athena integration supports: extracting metadata information (tables, columns, etc)

Proceed was follows to create an Athena connection to Stemma.

Create a user

To integrate with Stemma we recommend creating a new user in IAM that has access specifically scoped to the actions and resources required. The following script enables the required access:

Terraform

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.0"
    }
  }
}

provider "aws" {
  # You are only creating IAM which are global, you may
  # change the region but it should not impact this script
  region = "us-east-1"
}

resource "aws_iam_user" "stemma_read_user" {
  name = "stemma-read-user"
}

resource "aws_iam_user_policy" "stemma_read_policy" {
  name        = "stemma-read-policy"
  user = aws_iam_user.stemma_read_user.name

  policy      = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "athena:GetDatabase",
          "athena:ListDataCatalogs",
          "athena:GetDataCatalog",
          "athena:ListDatabases",
          "athena:ListTableMetadata",
          "glue:GetDatabases",
          "glue:GetTables"
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })
}

Generate credentials

After creating the user, generate an AWS Access Key and Secret for the user.

Provide credentials to Stemma

Now provide this information to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose AWS as the Connection type, provide a Connection Name, choose Access Key as the Authentication type, provide the AWS Access Key and AWS Access Key Secret you have just created, and specify the AWS Region:
  2. When you are finished, click Save Connection.

Testing the access

Optionally, you can validate the access to the catalog, databases and table that Stemma will have with the following code snippet; make sure to replace ACCESS_KEY and ACCESS_KEY_SECRET with the values for the Stemma Athena user:

import json
import boto3
from botocore.config import Config


client = boto3.client(
    "athena",
    aws_access_key_id="{ACCESS_KEY}",
    aws_secret_access_key="{ACCESS_KEY_SECRET}",
    config=Config(region_name="us-east-1"),
)
all_tables = []
for catalog in [cat['CatalogName'] for cat in client.list_data_catalogs()['DataCatalogsSummary']]:
    for database in [database["Name"] for database in client.list_databases(CatalogName=catalog)['DatabaseList']]:
        for table_name in [tbl['Name'] for tbl in client.list_table_metadata(CatalogName=catalog, DatabaseName=database)['TableMetadataList']]:
            all_tables.append(f'{catalog}.{database}.{table_name}')
            
print('All tables available:')
print(json.dumps(all_tables, indent=2))

Azure Active Directory Integration

Create an Azure Application

To create an application, follow these steps:

  1. Sign in to Azure portal.
  2. Click Azure Active Directory.
  3. On the Azure Active Directory page, click Enterprise applications.
  4. Click New application on the top menu.
  1. Click Create your own application.
  1. Enter a name for your application, we suggest “Stemma Data Catalog“.
  2. Select Register an application to integrate with Azure AD (App you’re developing).
  1. Click Create. This should redirect you to Register an application.
  2. Under Supported account types, select Accounts in this organizational directory only (Single tenant).
  3. Under Redirect URI (optional), select Web. If you are running a hosted version of Stemma, the redirect URI should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name. In case of on-premise deployment, please consult your Stemma representative for this.
  1. Click Register.

Create a client secret

To create a client secret, follow these steps:

  1. Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
  1. Select your newly added application. Make a note of the Application ID.
  1. Click Certificates & secrets on the left-hand side and click New client secret.
  1. Provide a description and an expiration length that follows your security organization’s guidelines.
  2. Click Add.
  1. Copy the Value corresponding to the newly-created client secret and client secret expiration.
    This is the Relying Party OAuth Client Secret. This secret value is never displayed again after you leave this page.

Set up application permissions

You need to set up two application permissions in your Azure portal so that you can use all of the Azure AD synchronization options in Stemma. You need to set up the following permissions:

  • Azure Active Directory Graph Directory.Read.All
  • Microsoft Graph Directory.Read.All

To set up permissions, do as follows:

  1. Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
  1. Select your newly added application.
  2. Click API permissions on the left-hand side and click Add a permission.
  1. Click APIs my organization uses and click Windows Azure Active Directory.
  1. Select the Azure Active Directory Graph > Directory.Read.All permission.
    a. Click Application permissions.
    b. Under Directory, click Directory.Read.All.
    c. Click Add permissions.
  1. Select the Microsoft Graph > Directory.Read.All permission.
    a. Click Application permissions.
    b. Under Request API Permission, click Microsoft Graph.
    c. Under What type of permissions does your application require?, click Application permissions.
    d. Under Directory, click Directory.Read.All.
    e. Click Add permissions.
  1. Click Grant admin consent for and then click Yes. You should see a message saying that you’ve granted consent for each of your permissions.

What We Need From You

Provide Stemma with the following information by email or Slack:

  • Application ID
  • Client Secret
  • Discovery Document Endpoint (Well Known OpenID Configurations)
  • Tenant ID – The ID of the tenant you created the application in.

BigQuery Integration

Stemma extracts table and column metadata including database, schema, table name, table description, column name, and column description from a Bigquery database.

What We Need From You

You will need to create a service account for reading metadata and grant it “BigQuery Metadata Viewer” access to all of your datasets. This can all be done via the BigQUery UI.

Next, collect the information Stemma needs, as follows:

  • Authorization/Access Scopes:
    Stemma uses the following scopes to extract metadata and usage information.
    — <https://www.googleapis.com/auth/bigquery.readonly>
    — <https://www.googleapis.com/auth/cloud-platform>
  • JSON Key File: Proceed as follows to create the JSON key file for a service account.

Create a service account (via Cloud Console):

  1. In the Cloud Console, go to the Create service account page.
  2. Select a project.
  3. In the Service account name field, enter a name. The Cloud Console fills in the Service account ID field based on this name.
  4. In the Service account description field, enter a description. For example, `Service account for Stemma
    BigQuery Integration
  5. Click Create and continue.
  6. Click the Select a role field, and make sure to select the following:
    — BigQuery Data Viewer (roles/bigquery.dataViewer) – to extract the metadata, watermarks and usage information. Popular tables will be generated using this information.
    — BigQuery User (roles/bigquery.user) – Needed to map the user’s information with the Tables i.e., Frequent Users, Table Owners, etc.
    — Private Logs Viewer (roles/logging.privateLogViewer) – Needed for enriching table usage.
  7. Click Continue, and click Done to finish creating the service account.

Do not close your browser window. You will use it in the next step.

Create a service account Key:

  1. In the Cloud Console, click the email address for the service account that you created.
  2. Click Keys, click Add key, then click Create new key.
  3. Click Create. A JSON key file is downloaded to your computer.
  4. Click Close.

Provide the information to Stemma

To provide this information to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Google Cloud as the Connection type, provide a Connection Name, and paste in the Service Account Key from the JSON file you have just downloaded.
  2. When you are finished, click Save Connection.

dbt integration

Stemma integrates with your dbt Cloud or Core projects and is able to ingest and keep up to date all of the metadata in the table below.

 dbtStemma
TablesTables
Table lineageTable lineage
ColumnsColumns
Table definitionsProgrammatic descriptions (dbt descriptions section)
Column definitionsColumn definitions

dbt Cloud

Integrating with dbt Cloud requires minimal information and access to the dbt Cloud account of your organization and an administrative access to Stemma.

The following 3 properties are needed from dbt Cloud:

  • Account ID
    • Usually visible within the browser URL on the account settings page: https://cloud.getdbt.com/next/settings/accounts/<account-id-number> or within any any dbt Cloud project: https://cloud.getdbt.com/next/deploy/<account-id-number>/projects/<project-id>.
  • Service Token
    • Create a Service Token with the appropriate permissions:
      • For organizations on the Team Plan, the account wide permissions required are Metadata and Read-Only.
      • For organizations on the Entreprise Plan, the necessary permission set is Account Viewer and Job Viewer for all projects that should have metadata ingestion.
  • Host
    • Unless organization specific customization has taken place, the host is likely cloud.getdbt.com

With these credentials in hand, all that’s left is adding the connection information to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection.
  2. In the pop-up window, choose dbt as the Connection type, provide a Connection Name, and fill in the API Token, Host and Account ID collected from dbt Cloud.
  3. When you are finished, click Save Connection

dbt Cloud integration may take up to 24 hours for initial ingestion.



dbt Core

To support ingesting metadata from dbt artifacts for dbt Core workflows, we require you to generate them daily and upload them to an AWS S3 cloud storage bucket provisioned by Stemma.

Stemma’s dbt integration relies on the following pieces of information:

  • The database name (e.g., “snowflake”, “postgresql”)
  • manifest.json and catalog.json (produced via dbt docs generate)
  • Optional: The base url for your dbt github repository (e.g., “https://github.com/{customer}/dbt-{customer}/tree/main
    “). This allows us to includes github links to your dbt models on Stemma’s table details page.

We recommend testing out the integration first by simply sending the two JSON files to your Stemma contact via Slack or email.

For an ongoing integration, we’ll provide you with the following information for the daily delivery of the dbt artifacts.

  • AWS Access Key pair
  • S3 Bucket (e.g, “s3://{customer}-stemma-integrations”)

The S3 prefix path will be “dbt/{date}” where “{date}” is the current days date in UTC in the format “YYYY-MM-DD”. For example, for a file delivery on 04/01/2022 we would expect to see the following files:

  • “s3://customerxyz-stemma-integrations/dbt/2022-04-01/manifest.json”
  • “s3://customerxyz-stemma-integrations/dbt/2022-04-01/catalog.json”

Delta Lake Integration

Proceed as follows to create a Delta Lake connection to Stemma.

Step 1: Generate a Personal Access Token

Follow these instructions to generate the Delta Lake personal access token that Stemma will need to extract your data into the catalog.

Step 2: Provide the Personal Access Token and Hostname to Stemma

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Delta Lake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows:
  • Provide the Personal Access Token you created in Step 1.
  • Provide the fully-qualified domain name of the Delta Lake host; for example mydlhost.mydomain.com.
  1. When you are finished, click Save Connection.

Google OIDC Integration

This document walks you through the steps to create and configure a Google OIDC application that Stemma can use to authenticate users.

Note: You must be a Google administrator to do this.

Steps

  1. Log in to Google Developer Console
  2. On the left, select Credentials
  3. Select + CREATE CREDENTIALS at the top

1664

  1. Select “OAuth Client ID”

948

  1. Enter the following information:
  • Application type: Web application
  • Name: You can choose the name; we suggest “Stemma OIDC” as a descriptive example
  • Authorized JavaScript origins: This must be “https://your-company.stemma.ai
    • Replace your-company with the correct value from your Stemma domain
    • If you have a custom domain name for Stemma, use that value here
  • Authorized redirect URIs: Provide the following URLs; make sure you replace your-company with the proper value:

1034

  1. Select Create at the bottom.

758950

  1. Download the Client ID and Client Secret JSON file and provide them to Stemma by email or Slack.

Hive Integration

What we need from you

Stemma needs certain information and credentials to extract Hive metadata into the catalog. Contact Stemma by email or Slack, and provide the following:

  • Metastore Host: Hostname is the IP address of the Hive server to which you are connecting.
  • Username and Password: You will need to provide a username and password for Stemma to use to access the Hive schema.
  • Metastore Port: Server port used for accessing metadata about hive tables and partitions. The default Hive metastore port is 9083.
  • List of Databases: Stemma whitelists databases, and so you will need to provide a list of the databases we will be importing.

Metadata extracted

The metadata Stemma extracts includes:

  • TBLS – stores basic information about Hive tables, views, and index tables.
  • DBS – stores the basic information of all databases in Hive.
  • PARTITION_KEYS – the field information of the table storage partition.
  • TABLE_PARAMS – stores the attribute information of the table/view.
  • SDS – saves the basic information of file storage, such as INPUT_FORMAT, OUTPUT_FORMAT, whether it is compressed or not.
  • COLUMNS_V2 – stores the field information corresponding to the table.
  • PARTITIONS – stores the basic information of table partitions.

JIRA Integration

This document walks you through the steps you need to take to integrate Stemma with your JIRA installation. You must be a JIRA Admin to do this.

Steps

Step 1 – Collect information about your JIRA environment

Collect the following information to provide to Stemma (see Step 3):

  • The URL for your JIRA environment (e.g., <https://jira.net>)
  • The project ID for the JIRA project you would like Stemma to create tickets in
    See this page for guidance from Atlassian on how to obtain the project ID for a JIRA project
  • If using JIRA Server or Cloud, the ID for the JIRA Bug issue type
    • Note: Skip this step if using JIRA on-premise
    • See this page for guidance from Atlassian on how to obtain the ID for the Bug issue type
  • Optional: The name of one or more labels you would like set on tickets created by Stemma

Step 2 – Create a service account

Stemma will need an email and API token to authenticate to your JIRA environment. Our recommendation is to create a JIRA service account so that the authentication is not tied to an actual user’s account.
a. Go to Administration > User management > Users
b. Click Create User and fill out the required information
c. Create an API token from the service account

  • For JIRA cloud see this page
  • For JIRA server see this page
  • For JIRA on-premise this feature is not currently supported so you will provide the password instead

d. Grant permission to the service account

  • The service account will minimally need access to create and view tickets in the JIRA project you intend to use
  • You can either grant permissions directly to the service account or you can grant the service account a role

Stemma needs:

  • The email address of the service account
  • The API token or password

See Step 3.

Step 3 – Provide the information to Stemma

Provide the information from Steps 1 and 2 to Stemma by email or Slack.

Linking JIRA Issues to Stemma

To link a JIRA issue to a table in the Stemma catalog, you need to include a Table Key tag in the issue description. The tag format is Table Key: {SOURCE}://{DATABASE}.{SCHEMA}/{TABLE} [PLEASE DO NOT REMOVE].

Looker Integration

This document walks through the steps required to create a service account in your Looker, which will be connected to your Stemma Instance for dashboard metadata extraction.

Stemma uses the Looker APIs (via Looker Python SDK) to extract the metadata of Folders, Dashboard, Dashboard Elements and Explore Queries, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Looker. Read Looker’s docs: https://docs.looker.com/reference/api-and-integration

Stemma takes leverage of the API3 keys (Client ID & Client Secret) to make a secure connection between your Stemma and Looker instances. Below you can find out the steps required to create the new API3 keys for this integration.

What is extracted?

Stemma extracts information about Folders, Dashboard, Dashboard Elements, Explore Queries, and data sources associated with them.

Stemma will not extract Dashboards that are deleted, hidden or in a user’s personal folder.

Steps

Step 1: Make sure you have a Looker Admin role

Looker → Stemma integration requires an Admin role to fetch the correct data and build the relationships between the entities like Table Lineage, Dashboard Owners, etc.

So, as a first step, make sure you have an Admin role.

Step 2: Generate API3 keys

The next step is to create an API3 Key. From the Admin menu, navigate to the Users page.

Find the user you want to use to generate the API3 key, and click the “Edit” button under the Actions column.

Once you are on the “Edit User” page, find the API3 Keys settings, and click “Edit Keys

Click “New API3 Key” to generate a new Client ID and Client Secret.

Step 3: Provide the details to Stemma

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Looker as the Connection type, provide a Connection Name, and then fill out the fields in the dialog box as follows:
    • Base URL
      Example: <https://stemma.cloud.looker.com/>
    • Client ID
      The one you generated in Step 2
    • Client Secret
      The one you generated in Step 2
  2. When you are finished, click Save Connection.

Mode Integration

The Stemma Mode integration accesses the Mode Discovery API. This API is distinct from regular Mode API: it is much more performant but only contains a subset of the available endpoints. This means Stemma is able to access only metadata about reports, not the report runs/results. Stemma also accesses the regular REST API to augment the information retrieved from the Discovery API. In all, there are four values required by Stemma:

  1. Your organization name in Mode
  2. The access token to access the REST API
  3. The password token to access the REST API
  4. A signature token to access the Discovery API

Optionally, if you would like Stemma to only extract metadata from specific Mode spaces you may also provide Stemma a list of the spaces to include.

Creating the Access Token & Password Token

You can learn about how to create this access here (https://mode.com/developer/api-reference/authentication/).

The access token and password token allow Stemma to augment the information within the Discovery API with the regular REST API to unlock two additional capabilities:

  1. The ability to link your dashboards to the underlying tables in your warehouse which are used to serve the dashboards.
    • Stemma will use the general Mode API for datasources to retrieve your data source information to find out the name of the database (you may use, PUBLIC, mktg, etc.) as well as the type of data source (e.g. Snowflake, Redshift, etc.) in order to apply the proper SQL parsing semantics.
  2. The ability to rank your dashboards by the number of times the dashboard has been viewed in the past 90 days.
    • Stemma will use the general Mode API for reports API to retrieve an accurate view count.

Creating the Signature Token

To authenticate with the Mode Discovery API a signature token must be created. This signature token can be created by running a curl command using the account’s REST API access token and password secret. The following code is a snippet from Mode that has been slightly adjusted to create a signature token that Stemma can use. Make sure to replace {organization} , {your-access-token} and {your-access-secret} with the appropriate values (organization can be found in the URL when you visit Mode, e.g. https://app.mode.com/<organization>/spaces). Visit the Mode documentation for information on creating the api-key and api-secret.

curl --location --request POST 'https://app.mode.com/batch/{organization}/signature_tokens' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--user '{access-token}':'{your-access-secret}' \
--data-raw '{
    "signature_token": {
        "name": "stemma-api-token",
        "expires_at": "2024-10-02T21:00:00+0000",
        "auth_scope": { "authentication_for" : "batch-api", "authorization_type": "read-only" }
    }
}';

Please note, if this does not work it likely means you do not have permissions. Please contact Mode support with help creating a read-only signature token.

APIs used

APIs are listed in the table below. In the table below, Data Extracted is the data we actually store from the API (and may change as we add functionality). Risks highlights any potentially-sensitive information in the API (whether or not we actually store it).

Please note, that the “Report Stats” endpoint is not enabled by default in mode, so functionality coming out of it may not be available, until the endpoint is enabled.

NameDocsData ExtractedRisks
Chartshttps://mode.com/developer/discovery-api/analytics/charts/Chart name, type, query idLow: contains employee email addresses
Reportshttps://mode.com/developer/discovery-api/analytics/reports/Execution times, dashboard names, dashboard descriptions, creation/last modified times, owner email addresses,Low: contains employee email addresses
Report Statshttps://mode.com/developer/discovery-api/analytics/report-stats/Usage countsLow[1]
Querieshttps://mode.com/developer/discovery-api/analytics/queries/Name, Raw SQLModerate[2,3]
Membershipshttps://mode.com/developer/discovery-api/analytics/members/Mode usernamesLow: contains employee email addresses
Reports: detailedhttps://mode.com/developer/api-reference/analytics/reports/Dashboard view countsLow: contains metadata about the report itself, not the underlying executions of the report
Data Sourceshttps://mode.com/developer/api-reference/management/data-sources/Mode data source name and data source type.Low: contains access to information about the data source. Stemma should separately have access to this information but this API access is required to link a mode dashboard to a specific database.

Notes

Here are some additional notes about possibly sensitive information we may receive from Mode

  1. Dashboard execution times are non sensitive on their own, although it is possible to infer NPMI from who is running which dashboard queries at precise times. Stemma only displays aggregated forms of this information, but stores it disaggregated.
  2. Query names may contain NPMI (e.g. a query titled “EMEA Expansion Potential”, when the company hasn’t announced its intention to expand into EMEA).
  3. Raw SQL queries may contain sensitive information (PII), typically in the form of filter clauses. This is generally not allowed under most security policies especially under Open Collections, however, it is a possibility. By default, these SQL statements will be shown to users in Stemma.

Configuring Your Stemma Connection

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, optionally add a name for the connection, and choose Mode from the Connection Type drop-down menu.
  1. In the dialog window, provide the signature token and Mode organization name, and optionally the access token and password token..
  2. When you are finished, click Save Connection.

Okta Integration

This document walks you through the steps to create and configure an Okta application that Stemma can use to authenticate users.


Note: You must be an Okta administrator to do this.


Steps

  1. Log into Okta admin portal
  2. In the left frame, select Applications :
  1. Select Create App Integration:
  1. Select the options for OIDC – OpenID Connect and Web Application.
  1. In the New Web App Integration, enter the following values:
    • App Integration Name: Your app name, we suggest Stemma Data Catalog
    • Grant Type: select the following –
      • Authorization code
      • Client Credentials (Client acting on behalf of itself)
      • Refresh token
    • Sign-in redirect URI:
      • Use the fully qualified hostname and add the endpoint:
        • /auth
        • If you are running a hosted version of Stemma, this should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name
    • Sign-out redirect URI
      This should be the base Stemma URL. Example:
      • https://<customer>.stemma.ai
    • Trusted Origin:
      • If you are self-hosting Okta, provide the Base URI for your Okta Sign domain, otherwise, leave this blank
    • Assignments:
      • Select the assignment access you would like within your organization
  1. Stemma will require access by certain APIs to retrieve user information from your organization’s directory.
  • Navigate to Okta API Scopes :
  • Make sure to grant permission for the following scopes:
    • okta.users.read
    • okta.users.read.self
  1. Provide the following values to Stemma by email or Slack so that authentication can be built into your Stemma deployment:
    • Okta Discovery endpoint (well-known configs)
    • Client ID
    • Client Secret

752

Redshift Integration

What we need from you

Stemma needs certain information and credentials to extract Redshift metadata into the catalog. To provide these, proceed as follows.

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Redshift as the Connection type and then provide the following information in the dialog box:
    • A username and password for Stemma to use to access the INFORMATION_SCHEMA
    • The hostname of the Amazon Redshift cluster. You can find it in the Properties tab of your Redshift Cluster Details page. Under the Connection details section, copy the Endpoint up to but not including the colon.
    • A port number. The default port number for Redshift is 5439. From your Redshift Cluster Details page, you can find the port number under the Database configurations section of the Properties tab.
    • A list of databases: Stemma whitelists databases, so you will need to provide a list of the databases we will be importing.

Access

The Stemma Redshift integration requires access to two pieces of information within the Redshift install:

  • information_schema metadata
  • Query logs

information_schema

The specific tables and views we access include:

  • PG_GET_LATE_BINDING_VIEW_COLS
  • svv_external_columns
  • COLUMNS

Because of the way permissions work in Redshift, these tables are only populated with data to which the user has SELECT access. If you don’t want to use column stats, and therefore want to restrict access for this credential, Stemma will give you the SQL code required to create a materialized view. You create that materialized view as a superuser, and grant our credential the ability to read and refresh that materialized view.

Query logs

Stemma Redshift requires query logs. We can extract these in two ways: push or pull.

Push

The recommended integration involves configuring Redshift to dump user activity logs into S3.

There are multiple methods to achieve this, either using CloudTrail or directly from Redshift. the exact right way differs between organizations (many already have audit logging configured for other purposes).

Pull

Stemma can extract logs from STL_QUERY. However, because this option requires the credentials be a superuser, it is not the recommended integration method.

Column stats – Optional

The Stemma integration user must have full SELECT access to any tables or views for which you want to see column stats.

Slack Integration

This document walks through the steps required to install a Stemma Slack application in your workspace.

Prerequisite

Stemma does not publish the Stemma Slack application to the Slack Marketplace. To install this application for your workspace, you need to create a new application using an application manifest file.

If you have an application manifest file, make sure you have it handy before you begin. If you don’t have a manifest file, you can copy and paste the following into a file:

_metadata:
  major_version: 1
  minor_version: 1
display_information:
  name: Stemma
  description: Help you enrich your metadata directly via Slack
  background_color: "#0a1930"
features:
  bot_user:
    display_name: Stemma
    always_online: true
  slash_commands:
    - command: /stemma
      url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/slash/help
      ## Replace {SUBDOMAIN} with your Stemma subdomain.
      description: Learn how StemmaBot works
      usage_hint: help
      should_escape: false
oauth_config:
  scopes:
    bot:
      - app_mentions:read
      - channels:history
      - channels:read
      - chat:write
      - commands
      - groups:history
      - groups:read
      - links:read
      - links:write
      - users:read
      - users:read.email
settings:
  event_subscriptions:
    request_url: https://{SUBDOMAIN}.stemma.ai/stemma/slack/events
    ## Replace {SUBDOMAIN} with your Stemma subdomain.
    bot_events:
      - app_mention
      - link_shared
      - member_joined_channel
      - message.channels
      - message.groups
      - team_join
  interactivity:
    is_enabled: true
    request_url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/interactivity
    ## Replace {SUBDOMAIN} with your Stemma subdomain.
  org_deploy_enabled: false
  socket_mode_enabled: false
  token_rotation_enabled: false

Save the file with any prefix and in any location you like, but make sure:

  • The filename has the suffix .yaml.
  • You replace {SUBDOMAIN} with your Stemma subdomain.

Steps

Step 1: Verify your application manifest file

Make sure that at least the following properties in the application manifest file match your Stemma endpoint:

  • features → slash_commands → url:
  • settings → event_subscriptions → request_url
  • settings → interactivity → request_url

Step 2: Create and install your Stemma Slack application

Create a new application using the application manifest file.

a. Navigate to: https://api.slack.com/apps, and click Create an App.

b. You will be prompted to select a method to create a Slack Application. Select From an app manifest, and choose the workspace in which you want to install the Stemma Slack application.1088

c. Paste in your manifest file. Make sure you select the YAML format, and paste in the contents of the file (see Prerequisite above).1088

d. Click Next and verify the scopes and events the Stemma Bot is requesting. Finally, click Create to install the application to your workspace.

Step 3: Provide Signing Secret and Bot Token to Stemma

To secure the communication between Slack and Stemma, you need to provide Stemma with the Signing Secret and Bot User OAuth Token of your newly installed Stemma Slack application:

  • Signing Secret:
    • You can find the Signing Secret on the Basic Information page, in the App Credentials section.

2144

  • Bot User OAuth Token:
    • You can find this token on the OAuth & Permissions page, in the OAuth Tokens for Your Workspace section.

2364

Now provide the Signing Secret and Bot User OAuth Token to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Slack as the Connection type, provide a Connection Name, and then in the dialog box enter the Bot Token and Signing Secret.
  2. When you are finished, click Save Connection.

Step 4: Receive confirmation from Stemma

Stemma will use the information you provide to integrate your Slack workspace with your Stemma Instance. Once you get the confirmation, you should be able to type /Stemma help in Slack to see usage help for the Stemma Slack application.

Step 5: Make sure to add an app icon 😊

Under the Stemma (Slack Application) basic information Tab, add the Stemma icon in the “Display Information” section. Users will be much more likely to use the bot if it’s easy to identify.1024

(Nice to have) Please invite Stemma team member(s) to a shared test channel

At this point, you have the Stemma Slack application installed within your Slack workspace. In order to verify if things are working just fine with no errors, you can create a temporary shared channel and invite a couple of members from the Stemma team to that channel.

For information about using Stemma with Slack, see Using the Stemma Bot in Slack.

Snowflake Integration

What we need from you

Stemma needs certain information and credentials to extract Snowflake metadata into the catalog. To provide these, proceed as follows.

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Snowflake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows:
    • Credentials
      We support authentication with a username and either a password or a private key.
      • Username and Password: You need to provide a username and password for Stemma to use to access the INFORMATION_SCHEMA as described above.
      • Username and Private Key: If you set up authentication via key pairs you will need to provide Stemma with a private key to use when accessing INFORMATION_SCHEMA as described above. If your private key is encrypted, you will also need to provide Stemma with the passphrase.
    • Snowflake Account Name: If you use the snowflake web interface, the URL you use should look something like abc.snowflakecomputing.com. Everything that appears before Snowflake computing is your account name.
    • Warehouse: The name of the warehouse (also called Virtual Warehouse) Stemma will connect to.
    • Role: The role Stemma will use once connected. If Stemma is extracting query logs, this role will need to have MONITOR access on the warehouse.
  2. When you are finished, click Save Connection.

Access

The Stemma Snowflake integration requires read access to all tables that will be included in the catalog, plus query logs.

Metadata

The specific tables and views we access include:

  • TABLES
  • COLUMNS

Due to the permissions model of Snowflake, these tables are only populated with data to which the role has SELECT access. Metadata will be extracted one database at a time since each Snowflake database has its own INFORMATION_SCHEMA.

Query logs

Stemma extract logs from the QUERY_HISTORY table function in the INFORMATION_SCHEMA. Logs are used to inform lineage, usage, common usage patterns and more. Due to the permissions model of Snowflake, this table is only populated with queries executed in virtual warehouses that the role has MONITOR access on.

Access History

Stemma uses ACCESS_HISTORY to enrich lineage derived from query logs. To grant Stemma access, Stemma’s user role will need to have the IMPORTED PRIVILEGES:

grant imported privileges on database snowflake to role stemma_role;

If Stemma cannot read Access History, lineage will still work however accuracy will be reduced.

Note: If you try giving the Stemma role access only to the required view, without giving access to the SNOWFLAKE database and ACCOUNT_USAGE schema before that, Snowflake will probably reject your request with the “Grant not executed: Insufficient privileges.” error.

Tableau Integration

Usage information

The Stemma Tableau integration will sum view counts for all workbooks in a particular dashboard over all time.

Installation Instructions

This document walks through the steps required to create a service account in your Tableau, which will be connected to your Stemma Instance for metadata extraction.

Stemma uses the Tableau Metadata API (GraphQL) to extract the metadata of projects, workbooks, dashboards, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Tableau.
Read Tableau’s docs: https://help.tableau.com/current/api/metadata_api/en-us/index.html

Connection Type:

We support both Online and Extract Connections with no differences.

Steps

Step 1: Decide your authentication strategy

The Tableau Metadata API is authenticated using a Personal Access Token (PAT) attached to a specific user within your organization. You will always share with us the PAT, not the username/password of the user itself. The PAT has the same permissions as the user to which it is attached.

Most users simply will simply create a PAT attached to their admin account. This is the easiest to set up and simplest to manage longer term, and we recommend it. For this option, skip to Step 4.

However, because this means the PAT has write access to the instance and Stemma doesn’t use write access at all, this doesn’t strictly follow principle of least privilege. In cases where restricting such access is critical, we recommend creating a read-only service account, and then you’ll generate a PAT for that limited account. Continue to Step 2 for this option.

Step 2: Create a new Viewer user

If you decide to create a new user for this integration (otherwise skip to Step 4 to use your current user to generate the Personal Access Token), navigate to the Users menu from the side navigation.

Click on “Add Users” and select “Add Users by Email“. A popup will appear with a Form to create new Users.

Select the “Tableau” radio button, and write down the email address you want. You will not share this email address with Stemma, and use this to sign in to the Tableau instance in order to create a PAT.

Email Tip: If you use Google email, [your_current_user][email protected][your-domain] will allow you to receive emails on your current email address and satisfy the uniqueness requirement.

Select “Viewer” as the Site role, which will give this user minimum access to the Tableau, and click “Add Users” to finalize the user creation.

Step 3: Set up permissions for the user

The next step is to make sure the user has permission to read your externally connected data sources. This will be used to link the workbooks with the tables in Stemma. By default, a “Viewer” user will not have access to any data source.

Click on the “External Assets” from the side navigation, click on “Select All“. A new dropdown will appear. Click on the “Actions” dropdown and select “Permissions“.

A Permissions dialog will appear under the “Permission Rules” section. By default, you will be on the “Database” tab.

Click “+ Add Group/User Rule” and select the user you created in Step 2 of this document, set the Template “View” and Save the rule.

Click on the “Tables” tab, and repeat the steps to add the rule for the user with the “View” template.

Stemma uses information about the upstream data sources in order to automatically create data lineage from your dashboard to your upstream table. In order to enable this, Stemma needs access to read the data source metadata. To enable the Stemma user permissions required by the Tableau Metadata API, you will need to provide “View” and “Connect“” access to each data source and workbook that Stemma will read from.

Step 4: Create a Personal Access Token

Log in to your Tableau instance with the user you want to use for Stemma integration. If you created a new user for this integration, make sure you log in as that user.

Select My Account Settings from the dropdown of your profile.

Scroll down to the “Personal Access Tokens” section, enter the token name “stemmaViewer” and click “Create new token”.

A new pop-up will appear with the personal access token and secret. Make sure you copy these somewhere safe as they will not be displayed again.

Step 5: Provide the details to Stemma

a. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:

b. In the pop-up window, choose Tableau as the Connection type, provide a Connection Name, and then in the dialog box fill out the fields as follows:

  • Base URL
    Example: <https://sample.online.tableau.com>
  • Access Token Name (see Step 4).
  • Access Token Secret (see Step 4).
  • Tableau Site Name

c. When you are finished, click Save Connection.

Optional:

You can opt to exclude or include some items by providing the following to Stemma by email or Slack:

  • Name of Projects to Exclude: If you want to exclude any sample projects or test projects from Stemma, provide the names of those projects.
  • External Assets Types to Include (Connection type):This is uncommon, but if you want to include and link only certain types of external assets with your Tableau workbooks in Stemma, provide the names of those assets you want to include and link. Note that this may limit the user experience of Stemma, and is not recommended.

Connecting to AWS S3

Stemma supports two methods of transferring files:.

  1. Stemma creates an S3 bucket: you provide Stemma the ARN for a role/group that should have access to read to and write from the bucket. You then provide the role/group to the users or resources that need access to the bucket. Stemma recommends this method.
  2. You create the S3 bucket: Stemma provides you the ARN of the role that the Stemma app uses. You will need to add the following policy to the bucket to allow Stemma access. Make sure you replace <BUCKET_NAME> and <ARN_FROM_STEMMA> with the appropriate values.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "StemmaBucketAccess",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<BUCKET_NAME>/*",
                "arn:aws:s3:::<BUCKET_NAME>"
            ],
            "Condition": {
                "ArnLike": {
                    "aws:PrincipalArn": "<ARN_FROM_STEMMA>"
                }
            }
        }
    ]
}

Guidelines for a Successful Stemma Rollout

Every Stemma rollout is different, depending on your organization’s particular needs, your data storage and management applications, and the people who use them in various ways and with varying priorities. So the following are guidelines, not rules, but they are based on the accumulated experience of Stemma’s founders and engineers, and are a good place for you to start.

Criteria for Success

The two criteria that matter in measuring the success of a data catalog are

  • adoption; and
  • customer satisfaction

If your rollout achieves both of these in a high degree, you can declare victory in the knowledge that you have improved the lives– and in particular the productivity, efficiency, and awareness– of your users and are contributing significantly to your organization’s success.

The following broad steps are designed to help you get there.

Steps for Rollout and Adoption

Step 1. Start with a persona and its use-cases

There are many user personas and use cases for a data catalog. Here’s a simplified view of the most common ones. It’s less important which persona you start with than that you choose a specific group of alpha users.512

Step 2. Launch in phases

Phase 1: Identify a small set of tables to get alpha user feedback on.

This set can be your organization’s most commonly used tables (often referred to as “core” tables) or one domain within the company, such as marketing, growth or finance, etc. Core tables are often the best choice, because they have the most impact and there’s often a central data team which is responsible for maintaining them.

Phase 2: Populate MVP metadata on these tables.

  • Stemma can ingest the majority of this metadata automatically (see What Stemma Enables You To Do) leaving you and your users to curate only certain important items such as ownership, table descriptions and table status
    .* Where you must, to glean tribal knowledge, it may also help to do a “docs jam session” with a group of data producers and consumers. (You might even offer a reward such as a gift card for those who enter the most documentation!)

Phase 3: Alpha launch to 5-20 alpha users.

  1. From the persona you decided on earlier, choose highly vocal users and those with the most tribal knowledge. These users will become Stemma’s avid supporters when you launch to a broader audience. (Note:Make sure these alpha users have the access they need to do work in Stemma (see SSO integration under What We Need From You) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake
  2. Incorporate feedback and iterate. Look particularly for feedback that indicates opportunities for productive metadata propagation and cross-referencing (for example, someone might say something like, “Oh, we already have this metadata in this spreadsheet — we should pull that in here, too.” )

Tasks your alpha users should become comfortable doing include:

  • Search – Search for tables and dashboards using a Page-Rank style search

720

  • See table details:
    • Description and status:
    • Issues (JIRAs)
    • Last updated
    • Data owner(s):
336

441

  • See column descriptions:

1296

  • Perform table actions:
    • Edit table status, description, issues, owners
    • Add/edit/remove tags
336

144

  • Show items commonly used with tables:

1008

  • See table lineage
  • Search for dashboards and see dashboard details:
    • Owners
    • Created time
    • Last updated time, etc.

308

  • See queries run by the dashboard, etc:

720

Phase 4: Beta launch to all users of the prioritized persona.

  1. Focus your beta launch on your chosen persona (data consumers, for example). Don’t blur the focus of your launch by opening it up to all personas. Those other personas will come on board at GA, after your alpha and beta users have thoroughly learned and exercised the product and can serve as its advocates and in-house experts.
  2. Graduate to GA if you can meet success targets.

Step 3. Achieve widespread adoption

Best practices for achieving the greatest adoption include:

  • Update Slack channel headers where people ask each other questions. Stemma’s Slack integration links these conversations to the catalog automatically.
  • Embed Stemma training in new hire training. Tagging data sets by domain (marketing, growth, etc.) can help new hires quickly onboard to their domains. Showcase Sremma as an entry point into your existing training, for example by having all technical new hires instrument a metric during onboarding.
  • Create linkages with other products. Create links between various data tools. For example, auto-populate a link between an Airflow DAG that populates a table and the table page in Stemma (and vice-versa). (See Sharing additional metadata). Another productive linkage is between the table page in the data catalog and a link to the code that is used to generate the table.
  • Showcase the catalog at a group or company meeting. Deliver a short five-minute demo at a meeting that includes your target users. Educate, answer questions, and thank your alpha users — this creates more awareness and provides an opportunity for everyone to learn.

Step 4. Measure success

1. Adoption

  • Track WAUs (Weekly Active Users).
    • Start with Weekly Active Users (WAUs) rather than Daily Active Users or Monthly Active Users. Common usage frequency is weekly, not daily or monthly.‍
      • Target Penetration rate: 80%. A great penetration rate is 80% of WAUs within your target persona.

2. Customer Satisfaction (CSAT)

  • Measure CSAT periodically.
    • Send out a survey every three to six months asking users to rate their experience with Stemma.

Other metrics that organizations often consider include documentation quality, search quality, etc. But it’s best to stick to the above core metrics at the outset. As your Stemma implementation matures and ingests more metadata over time, you can begin to broaden your criteria.

Migrate from existing Amundsen

Migrating from Amundsen involves three main steps:

  1. Migrating existing metadata from Amundsen to Stemma
  2. Ingesting metadata continuously
  3. Cut over

1 – Migrating existing metadata from Amundsen to Stemma

In this step, you take all the existing metadata from Amundsen and load it into Stemma. The metadata imported includes, but is not limited to:

  • Table and column descriptions
  • Tags
  • Ownership information

To import this metadata, upload a dump of your Amundsen neo4j metadata into Stemma:

  1. Bring the Amundsen storage offline.
  2. Use dump -> load, available on all neo4j installations, to dump and upload your data.

After import, your Stemma instance will look very similar to your existing Amundsen instance:512

But your data won’t be periodically updated until you enable periodic updates; see the next step.

2 – Ingesting metadata continuously

You need to configure Stemma to ingest metadata updates on an ongoing basis. Once this configuration is done, you won’t need Python databuilder jobs to ingest data.

Information ingested in this step includes:

  • Table and column names
  • Linked issues (JIRA tickets)
  • Frequent users
  • Lineage, if it exists

IIn Stemma’s Admin interface, provide credentials to your data sources to allow Stemma access to extract metadata on an ongoing basis:

374

Stemma will now do the work to “upsert” metadata updates, as they occur, into the data you uploaded from Amundsen in the previous step.

3 – Cut Over

At this point, Amundsen and Stemma are running side-by-side, and Stemma recommends sharing access to Stemma with your power users and getting their feedback to ensure all of their use-cases are supported. Usually one week of overlap to obtain feedback and an additional week to incorporate feedback, if applicable, is sufficient.

When all looks good, simply redirect the URL of your internal Amundsen to the Stemma URL.

If you have any feedback or questions about moving from Amundsen to Stemma integration, please reach out to [email protected]!

Using Stemma

Here you’ll find instructions on how to use all the features in Stemma.

Using the Stemma Bot in Slack

Stemma’s goal is to capture information about data no matter where it happens. Slack is increasingly where conversations about data exist, and Stemma has a super simple integration that allows you to pin Slack conversations to tables.

Here’s how it works. If someone asks a great question, or posts a great answer about a piece of data in Slack, you can open a thread and tag the Stemma bot:748

You can now go to the Table Detail page in the Stemma UI to see the linked conversations:1186

Where can you use the Slack bot?

You can use it in public or private channels. You just need to invite the bot into the channel. It cannot be used in direct messages.

Some other things to keep in mind about the integration:

  • You must respond in a thread
  • You need to give a precise definition. If a table is only present in one schema, “@Stemma table_name” will work just fine. But if it’s present in multiple schemas, you’ll need to use “@Stemma schema.table_name”

If you have any feedback or questions about the Slack integration, please reach out to [email protected]!

Messaging Users

Stemma provides a quick and easy way to send messages to specific groups of users on the basis of their relationship to a table (e.g. are they an owner in Stemma, have they queried the table in the prior three months, etc.). If you need to announce a future change, for example, or notify users when significant events occur, such as when a table has been deprecated, you can use Stemma messaging to ensure that the affected users will receive the notification.

🚧 Before you begin – enable email and Slack notifications

If you want to enable Slack notifications, make sure that you have completed the Slack Integration

Notification sources

There are two forms of notifications within Stemma:

  1. Table status changes: you can send a message to affected users when the status is changed, for example from Certified to Deprecated.
  2. Custom messages: You can create your own message and choose which groups of users and teams to send it to.

Table status changes

You can notify the owners of a table when the status changes.

📘 Note

This feature sends a notification to all owners of the resource. If you want to notify other selected user groups, you can use the Message functionality. Learn more about user groups.

Custom messages

You can compose a custom message to users and teams related to a table by clicking the Message button on the top-right corner of the Table detail page.

Select the user groups you want to notify and type your message in the box.

Notification types

There are three types of notifications you can send in Stemma, depending on the receivers and your personal preferences. If a member of a related group is a user, they can receive either email notifications or Slack private messages. If a member of a related group is a team (Slack channel), they will receive a message in Slack.

Slack notifications

After you have completed our Slack Integration, you can send and receive messages from the Stemma bot.

Messages in channels

If a member of a related group is a team, they will receive a message from the Stemma bot in the corresponding Slack channel.

📘 Note

Make sure you add the bot to all channels that need to receive messages.

Slack private messages

You can choose to receive notifications from Stemma as Slack private messages. To receive messages from the bot, you need to add the Stemma application to your Slack configuration.

Click **Browse apps** in the **Apps** section located on the bottom-left side of the screen
Click Browse apps in the Apps section located on the bottom-left side of the screen
Select the **Stemma** app from the list.
Select the Stemma app from the list.

📘 Note

To modify your preferred notification method, contact your Stemma representative.

User groups

Each group contains a list of users and teams (Slack channels). In Stemma there are four different types of user groups related to a table:

  • Owners of the table
  • Owners of dashboards using the table
  • Owners of derived tables
  • Frequent users of the table

Stemma will only show the options that apply to the given asset. For example, if a user is sending a custom message for a table that does not have any downstream lineage there will not be an option for Owners of derived tables.

Using Tableau with Stemma

Stemma works with Tableau to help answer common questions data analysts have, such as:

Does this dashboard already exist?

Stemma indexes Tableau dashboards along with their underlying tables, providing a one-stop search for all data assets. Stemma also tracks how many times a dashboard has been viewed, instantly highlighting which is most commonly used.1400

Is the right data being used?

Stemma automatically correlates your Tableau dashboards to the corresponding table asset by reconciling the table names extracted from the Tableau Metadata API with the existing table metadata Stemma has captured from your warehouse. This allows you to start your exploration by finding trustworthy reports and, in turn, using the lineage to discover which tables are useful to you.1388

You can also explore the table’s lineage upstream or downstream to get an end-to-end view of where the data came from.1391

How can I use the metrics in this dashboard?

Tableau provides Stemma with the SQL that it creates to generate each sheet, and Stemma indexes this SQL so you can easily find the code. You can copy and paste the code into your IDE to jump-start the SQL exploration.1600

Best Practices for Data Ownership

👍 The three types of ownership that work well

It works well to break down ownership of the most important date into three distinct roles:

  • Delivery owner — Ensures that this particular data gets delivered on time, on a promised delivery SLA. Usually, this is a data engineer or analytics engineer responsible for developing and maintaining the pipeline that produces this data.
  • Domain owner — What does this particular value in a field (or column) mean? When does this particular event get triggered? Usually, this is a product engineer who created the event or the analysts and data scientists who use this data most frequently and understand the physical reality that the data represents.
  • Policy owner — Ensures that this data gets used according to the classification and terms associated with it. Sometimes you acquire data from a source that should not be used for a certain category of use-cases. For example, YouTube is permitted to show ads to kids, but not permitted to personalize them. Therefore, the personalization data can’t be used if the subject is a child. The person making these calls is usually not an engineer or data scientist, but someone on the Policy or Privacy team at the company.

👎 Types of ownership that often don’t work as well

  • Overall Quality owner: In practice it’s often hard to find people able and willing to take ownership of the end-to-end quality of a data set. This is because data engineers don’t consider themselves owners of the data that’s produced from the upstream application and don’t want to be responsible for hunting down a website bug that impacts the data in the warehouse. Product engineers, on the other hand, don’t have enough context about how data gets joined and transformed downstream to own the final derived data artifact. This may change as decentralized data management (or data mesh) is deployed more broadly.
  • Shared ownership: This often does not work in practice, though it can work well if the owner group shares a good understanding of the different types of ownership and the group can efficiently redirect each question to the right person.

🏥 The data engineer’s role

In practice, the data engineer plays the role that a triage nurse would play in an Emergency Room. When an issue arises (like a patient arriving in the ER), the data engineer triages to see what’s going on. Sometimes it’s a problem the engineer can fix), so they fix it and resolve the issue (like the nurse treating an injury on the spot). In other cases, the engineer redirect the issue to the appropriate owner (as in a referral to a doctor or other health practitioner). In short, the issue determines who the owner should be.

Using Stemma

Use Stemma to assign and document ownership. Divide ownership as follows:612

Stemma makes it easy to assign table ownership broadly: see Assigning owners in bulk

Assigning Ownership of Tables

You can assign ownership of a single table to individuals or teams. You can also assign owners to a set of tables in bulk, adding all the owners in a single transaction.

Note: You can assign owners to tables in Stemma, but not to dashboards. Stemma ingests ownership of dashboards from the underlying BI tool(s) and you need to make updates and modifications in those tools.

The importance of ownership

Ownership is one of the fundamental pillars of a data catalog. It defines responsibility by individual users and teams, adding tremendous value to the data in the catalog, including:

  • Delegation of duties: Which users are responsible for keeping this table properly documented?
  • Compliance with processes and procedures: Has the right team updated the documentation for this table before the newest pull request is merged?
  • Reporting and analytics: Which teams have the highest percentage of their assets documented? The lowest?

For advice on assigning ownership, see Best Practices for Table Ownership

Assigning ownership to an individual table

You can assign ownership of a table either to individuals or teams:

  • Individuals are the people in your company who use Stemma. You can find individuals by their email address or a Slack handle.
  • Teams are represented in Stemma by Slack channels.

To assign one or more owners to an individual table, choose the table in the Stemma UI and click on Add Owner at the bottom of the left panel.

But Stemma also allows you to bulk-assign table ownership across the catalog. You can do this in a single transaction, immediately allowing your users to identify the assets that belong to them.

Assigning owners in bulk

You may think about ownership as a hierarchy, or your teams may typically own all of the tables within a single database or schema, or perhaps your analysts use Dbt to create their own workflows and create tables with specific naming conventions (e.g. marketing_salesmarketing_prospects). Regardless of how your company thinks about grouping assets, Stemma can help you to bulk-assign owners, using advanced search to find any of the assets you want to update and assigning the owners with a single command.
To assign ownership of tables in bulk, proceed as follows:

  1. From the home page of the Stemma UI, navigate to Advanced Search.
  2. At the top of the left panel, make sure only the Tables resource is checked (you can’t bulk-assign owners to dashboards).
  3. Optionally use the filters in the left panel to limit the search results to the set of tables to which you want to bulk assign ownership:
    • Stemma suggests you limit your search results using only the filters on the left; while you can also use the search bar at the top, doing so makes it more likely that tables that you don’t want included slip into the results
  4. Click on Bulk Edit at the bottom right of the page and choose Add Owners.
  5. In the dialog box, enter or select an email address or Slack channel and choose Add.
  6. To add more owners, repeat Step 5 for each new owner.
  7. When you are finished, click Add Owners and then Done.

Stemma automatically pulls in the teams you’ve defined as well as the list of users in your organization:

Note: When you assign ownership in bulk, this does not trigger automatic messages to the owners and users; that only happens when you change to assign ownership individually.

Displaying Data Lineage in the Stemma UI

Tables

  1. Navigate to the details page for the table.
  2. Click on Lineage graph in the menu bar at the top of the page.
  3. You will see something like this:

From here you can:

Adjust the view: zoom in, zoom out, center, and fit to the page:

then choose Table Details to Table Lineage

See the table’s details page in the Stemma UI and a new lineage graph for each upstream and downstream table: click on the arrow and then choose Table Details or Table Lineage:

See and highlight the lineage of each table:

Collapse tables into groups: first hover over the logo:

hover over the logo

Then click the stack:

click the stack

To expand the groups again:

First click the down arrow
Then click the stack

Search, for an exact match only, on table properties. (In this example, statewide_testing will match, but not  statewide testing):

Columns

  1. Navigate to the details page for the table.
  2. Click on Columns in the menu bar at the top of the page.
  3. Click on the up-arrow next to the column whose lineage you want to see.
    Stemma shows the top five upstream and downstream columns.

Dashboards

  1. Navigate to the details page for the dashboard.
  2. Click on Lineage graph in the menu bar at the top of the page.
  3. You will see something like this:

From here you can:

Adjust the view: zoom in, zoom out, center, and fit to the page.

See the tables used by the dashboard and the columns used from each table:

See and highlight the lineage of each table:

Show the table’s lineage in an individual graph and the table’s details page in the Stemma UI; click on the arrow and then choose Table Details or Table Lineage:

Search, for an exact match only, on table properties (for example, statewide_cases will match, but not  statewide cases).

Collapse tables into groups and expand the groups again (see Tables).

Using Data Lineage Graphs

Use Data Stemma’s Lineage Graphs to:

Some examples

  • An analytics engineer builds a table using Dbt; that table becomes popular and several dashboards are built on top of it. It becomes well trusted, and the logic becomes a standard for calculating certain metrics. But the table is missing several important entity IDs (foreign key columns) which makes it difficult to join to other tables. The analytics engineer and a data engineer work together to complete a “micro migration” to a new, more performant and reusable table.
  • A machine-learning model is used to predict churn for a customer. Each day after the model runs, users that are predicted to churn are identified in a table in Snowflake and then sent an email to re-engage them. The product team is launching the second major product and wants to measure churn on a per-product basis. The engineering team updates their data pipeline to add a new column to the output table that identifies the product that churn is being predicted for.
  • An analyst created several views to consolidate the common logic that the team were using to query tables in their dashboards. But the performance of the views has deteriorated as the amount of data has grown. The analyst and the engineer work together to create materialized versions and to update all of the dashboards using the prior views.

Use Data Lineage in Stemma to Identify Affected Assets

The challenge with such changes is that it is hard to accurately curate a list of all assets that will be affected. If you cannot be certain of what will need to be changed or what could break, you cannot do risk analysis or even make a time estimate. Stemma enables you to do the analysis and create the appropriate messaging.

The key question for this type of change is “What tables and dashboards will I need to update if I rename this view?”

Stemma provides a lineage graph for each table and dashboard. From here you can see who and what is utilizing your resources downstream, and how popular your assets are. Data owners can use this in conjunction with their own knowledge to build roadmaps and plan how data should evolve.

Use Stemma Messaging to Communicate Expected Changes

Everyone affected needs to know what is going to change and when. After building your list of affected assets and their users, use Stemma messaging to provide advance notice and instructions.

How Stemma Discovers Data Lineage

When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.

In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.

Note

When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.

See also: How Stemma Captures Data: SQL Parsing Overview and Caveats

Sharing additional metadata via CSV files

Along with a direct connection to various data sources, Stemma can ingest data using extracts in various file formats. One of the most commonly used file formats is CSV.

Below are some of the CSV Formats Stemma ingests today (along with the sample files), with the column names and values we expect in each file. Note that you need the column header literals (for example,db_name, etc.) as well as the values.

Github Links

Used to provide a link to a Github source from Stemma’s Table detail page.

db_nameclusterschematable_namesourcesource_type
hivegoldtest_schematest_table1https://github.com/amundsen-io/amundsen/github

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
source: the Github URL for the repository
source_type: static value, “github”

Airflow

Used to generate a link between a table and an Airflow task. Once the data is ingested, you will see an “Airflow” button on the Stemma detail page for the table.

task_iddag_idexec_dateapplication_url_templatedb_nameschematable_namecluster
hive.test_schema.test_table1event_test2018-05-31T00:00:00https://airflow_host.net/admin/airflow/tree?dag_id=SUPER_AWESOME_DAGhivetest_schematest_table1gold

task_id:: the Airflow ID for the task
dag_id: the Airflow ID for the DAG
exec_date: the most recent execution timestamp for the task
application_url_template: the URL for the DAG in your Airflow instance
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
cluster: the name of the database within the data source.

Dagster

Used to link a Dagstr op and/or job to a table in Stemma. Once the data is ingested, you will see an “Dagstr” button on the Stemma detail page for the table.

application_url_templatedb_nameclusterschematable_name
https://dagster_host.net/dagster/tree?job=dagster_jobsnowflakeprodtest_schematest_table1

application_url_template: the URL for the DAG in your dragster instance
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.

Dbt Cloud

Used to link a Dbt Cloud job to a table in Stemma. Once the data is ingested, you will see a “Dbt” button on the Stemma detail page for the table.

application_url_templatedb_nameclusterschematable_name
https://dbtcloud_host.net/dbtcloud/tree?job=dbtcloud_jobsnowflakeprodtest_schematest_table1

application_url_template: the URL for the DAG in your Dbt Cloud
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.

Table Tags

Stemma can create and assign tags via a CSV file, with the following format:

db_nameclusterschematable_nametags
hivegoldtest_schematest_table1“tag1,tag2”
hivegoldtest_schema_2test_table2“tag1,tag3”

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
tags: the tags to be created. Tags are additive: existing tags on tables will not be modified.

Table Description

db_nameclusterschematable_namedescription
hivegoldtest_schematest_table1Description for table, optionally markdown.

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
description: the table description. Note that descriptions are destructive: if a description already exists, it will be overwritten.

Table Owners

db_nameclusterschematable_nameowners
hivegoldtest_schematest_table1[email protected],[email protected]
hivegoldtest_schema_2test_table2“#test-channel,#test-channel-2”

db_name: the data source name, for example, hive, snowflake, athena, redshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
owners: email addresses or slack channels. Owners are additive: existing owners of tables will not be modified.

Column Descriptions

db_nameclusterschematable_namecol_namedescription
hivegoldtest_schematest_tabletest_columnThis is an example column description.

db_name: the data source name, for example, hive, snowflake, athena, redshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
col-name: the column described.
description: the column description. Note that column descriptions are destructive: if an existing description is present, it will be overwritten.

Finding Important Information about Tables and Columns

Stemma provides out-of-the-box integration with applications such as Snowflakedbt, and dashboarding systems such as Looker to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.

The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.

The table details page provides following information:

Table attributes

Asset status:

This can be CertifiedDeprecatedIntermediate, or No Status:

336

Description:

A brief summary of the contents of the table:

338

See Advanced Description Editing for more information.

Issues:

JIRA issues logged against this table:

441

This section explains how to link a JIRA issue to the Stemma catalog.

Owners:

The person or people responsible for maintaining this table:

441

Click on an owner to see other assets owned by that user, as well as their bookmarks and frequently used tables. You can also email the user via this profile page.

See also Assigning Bulk Ownership of Tables.

Last Updated:

The date and time of the last update to the table (this can be different from the most recent date covered by the data itself):426

Date Range:

The date range covered by the data in the table.

173

Note that this need not be synchronous with the Last updated date and time, which simply reflects the date of the most recent modification to the table.

Tags:

Descriptive tags for cross-referencing, such as covidcases, etc.:

187

Frequent Users:

People who have frequently used this table:

288

Click on a user to see assets owned by that user, as well as their bookmarks and other frequently used tables. You can also email the user via this profile page. See also How Does Stemma Calculate “Top 5 Columns”, “Frequent Users”, etc?

Commonly Used With

Links to tables often used with this one (see Using Commonly Used With). See also How Does Stemma Calculate “Top 5 Columns”, “Frequent Users”, etc?

288

Table lineage graph:

See Displaying Data Lineage in the Stemma UI :

Dashboards:

Dashboards that use the table:864

See also Finding Important Information about Dashboards.

Upstream, Downstream:

Upstream and downstream tables:

:576576

See also Displaying Data Lineage in the Stemma UI

Slack Conversations:

Slack threads that mention this table:864

Column attributes

Column description:

Brief description of the column contents:864

See Advanced Description Editing for more information.

Data type:

Text or Number:

750

Column Statistics:

For example, minimum and maximum values over a period:720

Shared descriptions:

See Sharing Column Descroptions.

Column usage and lineage:

The most-used upstream and downstream columns, as measured by the number of queries, dashboard views, catalog views, etc.:

288

See also Tracking Data Lineage and How Does Stemma Calculate “Top 5 Columns”, “Frequent Users”, etc?.

Curated versus automated metadata

Where possible, Stemma captures the information directly from the source. This table shows which attributes are populated from the source and which you can edit:.

Table MetadataAutomatically captured?Users can edit?
Table description.
Asset status.
Owners✅ (dbt)
Tags✅ (dbt)
Jira issues
Last updated time
Data date range
Commonly used with tables
Frequent users
Upstream & downstream table lineage
Dashboards that use the table
Slack threads with table mentions

Notes

Editing curated items

To add or edit a curated item, hover you cursor over over it until you see a pencil; then click on the pencil and add or edit text in the resulting box.389389

Using Commonly Used With

Clicking on a link below this heading opens a dialog window. This window provides details of the usage; for example:1008

Advanced Description Editing

The Stemma UI provides a WYSIWYG editor for descriptions of tables and columns.

Using the editor

  1. To begin editing, navigate to the table or column whose description you want to edit, and click on the pencil icon next to Description:

This opens up the text box for editing:

  1. Add or change the description.
    • Descriptions can include links and images as well as text.
    • You can choose font styles as indicated by the buttons.
    • Use the pull-down menu to toggle the text format from the default (Normal) to common formatting elements such as headings, lists, quotes, and code blocks.
      • When you choose Code Block, you can choose the type of code from the secondary pull-down menu (the default is JavaScript).
    • You can add hyperlinks: enter and highlight the text to display, then click the 🔗button and enter the URL (and edit it if necessary) in the text box next the button.
    • You can also add links to other tables.
    • The editor also supports Markdown syntax, as well as any shortcuts your system supports (such as ⌘-b for bold on a Mac)
  2. When you are finished, click Update to save your changes.

Linking to other tables from the description

This is useful, for example, if you want to give other users more information about the relationship between the tables.

To link to another table from the description, do the following:

  1. In the Stemma UI, navigate to the table whose description you want to edit.
  2. Click on the pencil icon next to Description.
  1. Enter the new desription providing the related table’s<schema_name>.<table_name> anywhere in the text.
  1. Click Update to save your changes.

The new description will contain the link to the related table:

Sharing column descriptions

You can share a column description with all columns of the same name across the catalog. To do this:

  1. In the Stemma UI, navigate to the table whose column description you want to share.
  2. To add or modify a description, click on the down-arrow to the left of the entry for the column:
  1. Hover over the word Description and click on the pen that appears:
  1. Add or update the description and click on Update. Stemma automatically discovers all columns with the same name, and the description propagates to all linked columns. To see which columns are linked, and add and remove links, click on the three dots to the right of the column entry:

Click on the Edit linked columns pop-up. In the dialog window, you can:search for and add columns to link, and remove existing links.

When you have finished, click Done.

Finding Important Information about Dashboards

Stemma provides out-of-the-box integration with dashboarding systems such as Looker to capture rich metadata from these systems automatically.

The Stemma UI makes it easy for you to discover this important information about your dashboards, and provides a convenient way to add and modify curated information.

Status:

This can be Certified, Deprecated, Intermediate, or No Status:

288

Description:

A brief summary of the contents of the dashboard:

360

Owners:

The person or people responsible for maintaining this table:

288

This information is extracted automatically from the source and (unlike table ownership) can’t be edited in Stemma.

Tags:

Descriptive tags for cross-referencing, such as Covid, cases, etc.:

288

Created:

The date and time the dashboard was created:

360

Last successful run:

Date and time the dashboard was last run successfully:

360

Last Updated:

Date and time of the last update to the dashboard:

360

Last run:

Date and time the dashboard was last run (successfully or not):

360

Total View Count:

Total number of views over the life of the dashboard:

360

Tables:

Tables used by the dashboard:720

Charts:

Charts available from the dashboard:

360

Queries:

Queries run from the dashboard. Click on a query name to see the code:432

Open folder:

Open the dashboard’s folder in the underlying application (you may need to log in to the application).

Open dashboard:

Open the dashboard in the underlying application (you may need to log in to the application).

Using Admin Functions and Reports

API Keys

Allows you to add and delete keys for the Graphql API.

Connections

Allows you to add, modify and delete connections to Stemma. For instructions on adding connections, see Integrating Your Applications with Stemma: What We Need from You. See also How Do I Update an Existing Connection?

Reports

These reports provide an account of your organization’s interactions with the Stemma catalog, including curating work that needs to be done, as well as usage.They can be especially useful when you are rolling our Stemma to your users and scoping the work remaining to do.

  • Customer Summary
    Shows the number of data sources, tables, and dashboards that are integrated
    into the Stemma catalog.
  • Monthly product usage
    Shows the number of users and searches by month.
  • Documentation summary
    Shows the percentage of columns, dashboards, and tables missing documentation and/or ownership.
  • Assets without docs
    Shows the specific tables and dashboards that lack documentation. Click on the drop-down to switch between the two types of asset

864

  • Assets without ownership
    Shows the specific tables and dashboards that lack ownership (click on the drop-down to switch between the two types of asset).

288

How Stemma Captures Data: SQL Parsing Overview and Caveats

What Stemma captures

When you execute SQL against your warehouse, the database captures the raw SQL that was executed and saves it in a historical format that includes the time that the SQL command was executed as well as the user who executed it. Stemma reads these SQL logs and parses all of the SQL commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page:

  • Table and column lineage: any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
  • User table usage: Stemma uses select statements to associate users to the table(s) being queried; this provides the Frequent Users information on the Table Details page, search rankings, and the user’s profile page.
  • Commonly Used With: any joins that occur within the SQL, whether on the top-level query or as part of a sub-query, are captured and presented on the Table Details page to help other users understand how two or more tables are commonly joined.

Caveats:

  • When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
  • When user IDs in the database do not match the user portion of the user’s email address as recorded in Stemma, Stemma cannot associate the Stemma user to the user in the query logs. This prevents Stemma from creating any user-table usage information.
  • When system accounts are used to query the data, Stemma is often not able to see which user is executing the query. This is a common practice with BI tools when multiple team members share the same connection to a database and that connection uses a system account.
    • However, if you are using Mode, Stemma can parse the comments appended by Mode (which do contain the user ID) to find the user and associate that user to the query.
  • Commonly Used With does not capture joins where the left or right table is a dynamically generated select statement, for example: select * from table_a a join (select * from table_b b1 join table_c c1 on b1.id = c1.id) b on a.id = b.id.
  • Stemma currently supports Snowflake out of the box. Stemma can also parse RedshiftAthena, Postgres and other databases, but does not provide turnkey support for those databases at present.

Planning Changes to Tables or Columns

When you are planning to make a backward-incompatible change, you need to proceed in a way that minimizes the impact and gives downstream stakeholders as much warning as possible. Examples of backward-incompatible changes are:

  • Changing a schema (type changed, name changed)
  • Changing values (data fields will no longer be populated, or will be populated differently)

You need to understand and communicate the impact before you make such a change. This is usually done in the following steps:

  1. Understand the impact: understand who uses or queries this data.
  2. Provide heads-up notification: notify downstream owners and users that this data is going to change, and that they need to update their assets (downstream tables or dashboards) accordingly. Provide a time window after which the change will go into effect.
  3. (ideally): Understand the impact again: see if downstream users have changed their assets; how many still need to be nudged, etc.?
  4. Make the actual change and follow up to make sure that it’s complete.

Stemma can help you make the change as painless as possible. Use the UI to assess the effect on downstream users and on other assets. The following sections provide help:

If you are deprecating a table:

  • Change the description to warn users about the impending change.
  • Change the status toDeprecated.

Now warn all owners and frequent users about the change, and give them a deadline to make their adjustments.

Note

If you decide to change the name of a table to indicate that it’s deprecated (e.g., to something like deprecated_tablename), that will cause it, by default, to be removed from Stemma after ten days.

Using the External GraphQL API

Creating and Installing an API Key for the External Graphql API

1. Create an API key in the Stemma UI:

a. Navigate to the Admin panel and choose Add API Key:

432

b. In the pop-up window, enter a name for your API key and click Submit. You can choose any name, but Stemma recommends using a name that describes the intended use of the key (e.g., github action integration):432

Stemma generates an API key that appears in the window. Save this key somewhere secure. It is not recoverable.

2. Install the Altair Graphql Client as one of these:

3. Add your API key to Altair as an Authorization header.

864

It must be in the form Bearer <api key> with a space between the word Bearerand your API key.

4. Enter your URL and HTTP method:

  • URL = /graphql
    E.g., https://preview.stemma.ai/graphql
  • HTTP method = POST

Now click Docs in the upper right of the Altair UI; you should see the graphql schema on the right hand side:864

Integrating Your Applications with Stemma: What We Need from You

What we need to get you up and running depends on the applications you will be integrating with Stemma, but in general you will need to provide the information and do the tasks for Slack, SSO, and your dashboard and storage applications.

Instructions are on the following pages.

Information Stemma Extracts Automatically

This page shows what information you can expect to see in Stemma after you have integrated all of your applications.

Does Stemma Read My Data?

In general, Stemma reads metadata (queries, who ran them, names of columns or charts, etc.), not the data itself.

  • For dashboards, Stemma reads only metadata and doesn’t have access to numbers, or read charts or reports within the dashboarding system.
  • For tables the same is true, with one exception: if you choose to use the column statistics feature, Stemma reads the data to obtain that information.

For more information:

Next Step

Once you have integrated your applications into Stemma and everything is tested and working, you need to plan the rollout to your users.

Athena Integration

Stemma’s Athena integration supports: extracting metadata information (tables, columns, etc)

Proceed was follows to create an Athena connection to Stemma.

Create a user

To integrate with Stemma we recommend creating a new user in IAM that has access specifically scoped to the actions and resources required. The following script enables the required access:

Terraform

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.0"
    }
  }
}

provider "aws" {
  # You are only creating IAM which are global, you may
  # change the region but it should not impact this script
  region = "us-east-1"
}

resource "aws_iam_user" "stemma_read_user" {
  name = "stemma-read-user"
}

resource "aws_iam_user_policy" "stemma_read_policy" {
  name        = "stemma-read-policy"
  user = aws_iam_user.stemma_read_user.name

  policy      = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "athena:GetDatabase",
          "athena:ListDataCatalogs",
          "athena:GetDataCatalog",
          "athena:ListDatabases",
          "athena:ListTableMetadata",
          "glue:GetDatabases",
          "glue:GetTables"
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })
}

Generate credentials

After creating the user, generate an AWS Access Key and Secret for the user.

Provide credentials to Stemma

Now provide this information to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose AWS as the Connection type, provide a Connection Name, choose Access Key as the Authentication type, provide the AWS Access Key and AWS Access Key Secret you have just created, and specify the AWS Region:
  2. When you are finished, click Save Connection.

Testing the access

Optionally, you can validate the access to the catalog, databases and table that Stemma will have with the following code snippet; make sure to replace ACCESS_KEY and ACCESS_KEY_SECRET with the values for the Stemma Athena user:

import json
import boto3
from botocore.config import Config


client = boto3.client(
    "athena",
    aws_access_key_id="{ACCESS_KEY}",
    aws_secret_access_key="{ACCESS_KEY_SECRET}",
    config=Config(region_name="us-east-1"),
)
all_tables = []
for catalog in [cat['CatalogName'] for cat in client.list_data_catalogs()['DataCatalogsSummary']]:
    for database in [database["Name"] for database in client.list_databases(CatalogName=catalog)['DatabaseList']]:
        for table_name in [tbl['Name'] for tbl in client.list_table_metadata(CatalogName=catalog, DatabaseName=database)['TableMetadataList']]:
            all_tables.append(f'{catalog}.{database}.{table_name}')
            
print('All tables available:')
print(json.dumps(all_tables, indent=2))

Azure Active Directory Integration

Create an Azure Application

To create an application, follow these steps:

  1. Sign in to Azure portal.
  2. Click Azure Active Directory.
  3. On the Azure Active Directory page, click Enterprise applications.
  4. Click New application on the top menu.
  1. Click Create your own application.
  1. Enter a name for your application, we suggest “Stemma Data Catalog“.
  2. Select Register an application to integrate with Azure AD (App you’re developing).
  1. Click Create. This should redirect you to Register an application.
  2. Under Supported account types, select Accounts in this organizational directory only (Single tenant).
  3. Under Redirect URI (optional), select Web. If you are running a hosted version of Stemma, the redirect URI should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name. In case of on-premise deployment, please consult your Stemma representative for this.
  1. Click Register.

Create a client secret

To create a client secret, follow these steps:

  1. Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
  1. Select your newly added application. Make a note of the Application ID.
  1. Click Certificates & secrets on the left-hand side and click New client secret.
  1. Provide a description and an expiration length that follows your security organization’s guidelines.
  2. Click Add.
  1. Copy the Value corresponding to the newly-created client secret and client secret expiration.
    This is the Relying Party OAuth Client Secret. This secret value is never displayed again after you leave this page.

Set up application permissions

You need to set up two application permissions in your Azure portal so that you can use all of the Azure AD synchronization options in Stemma. You need to set up the following permissions:

  • Azure Active Directory Graph Directory.Read.All
  • Microsoft Graph Directory.Read.All

To set up permissions, do as follows:

  1. Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
  1. Select your newly added application.
  2. Click API permissions on the left-hand side and click Add a permission.
  1. Click APIs my organization uses and click Windows Azure Active Directory.
  1. Select the Azure Active Directory Graph > Directory.Read.All permission.
    a. Click Application permissions.
    b. Under Directory, click Directory.Read.All.
    c. Click Add permissions.
  1. Select the Microsoft Graph > Directory.Read.All permission.
    a. Click Application permissions.
    b. Under Request API Permission, click Microsoft Graph.
    c. Under What type of permissions does your application require?, click Application permissions.
    d. Under Directory, click Directory.Read.All.
    e. Click Add permissions.
  1. Click Grant admin consent for and then click Yes. You should see a message saying that you’ve granted consent for each of your permissions.

What We Need From You

Provide Stemma with the following information by email or Slack:

  • Application ID
  • Client Secret
  • Discovery Document Endpoint (Well Known OpenID Configurations)
  • Tenant ID – The ID of the tenant you created the application in.

BigQuery Integration

Stemma extracts table and column metadata including database, schema, table name, table description, column name, and column description from a Bigquery database.

What We Need From You

You will need to create a service account for reading metadata and grant it “BigQuery Metadata Viewer” access to all of your datasets. This can all be done via the BigQUery UI.

Next, collect the information Stemma needs, as follows:

  • Authorization/Access Scopes:
    Stemma uses the following scopes to extract metadata and usage information.
    — <https://www.googleapis.com/auth/bigquery.readonly>
    — <https://www.googleapis.com/auth/cloud-platform>
  • JSON Key File: Proceed as follows to create the JSON key file for a service account.

Create a service account (via Cloud Console):

  1. In the Cloud Console, go to the Create service account page.
  2. Select a project.
  3. In the Service account name field, enter a name. The Cloud Console fills in the Service account ID field based on this name.
  4. In the Service account description field, enter a description. For example, `Service account for Stemma
    BigQuery Integration
  5. Click Create and continue.
  6. Click the Select a role field, and make sure to select the following:
    — BigQuery Data Viewer (roles/bigquery.dataViewer) – to extract the metadata, watermarks and usage information. Popular tables will be generated using this information.
    — BigQuery User (roles/bigquery.user) – Needed to map the user’s information with the Tables i.e., Frequent Users, Table Owners, etc.
    — Private Logs Viewer (roles/logging.privateLogViewer) – Needed for enriching table usage.
  7. Click Continue, and click Done to finish creating the service account.

Do not close your browser window. You will use it in the next step.

Create a service account Key:

  1. In the Cloud Console, click the email address for the service account that you created.
  2. Click Keys, click Add key, then click Create new key.
  3. Click Create. A JSON key file is downloaded to your computer.
  4. Click Close.

Provide the information to Stemma

To provide this information to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Google Cloud as the Connection type, provide a Connection Name, and paste in the Service Account Key from the JSON file you have just downloaded.
  2. When you are finished, click Save Connection.

dbt integration

Stemma integrates with your dbt Cloud or Core projects and is able to ingest and keep up to date all of the metadata in the table below.

 dbtStemma
TablesTables
Table lineageTable lineage
ColumnsColumns
Table definitionsProgrammatic descriptions (dbt descriptions section)
Column definitionsColumn definitions

dbt Cloud

Integrating with dbt Cloud requires minimal information and access to the dbt Cloud account of your organization and an administrative access to Stemma.

The following 3 properties are needed from dbt Cloud:

  • Account ID
    • Usually visible within the browser URL on the account settings page: https://cloud.getdbt.com/next/settings/accounts/<account-id-number> or within any any dbt Cloud project: https://cloud.getdbt.com/next/deploy/<account-id-number>/projects/<project-id>.
  • Service Token
    • Create a Service Token with the appropriate permissions:
      • For organizations on the Team Plan, the account wide permissions required are Metadata and Read-Only.
      • For organizations on the Entreprise Plan, the necessary permission set is Account Viewer and Job Viewer for all projects that should have metadata ingestion.
  • Host
    • Unless organization specific customization has taken place, the host is likely cloud.getdbt.com

With these credentials in hand, all that’s left is adding the connection information to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection.
  2. In the pop-up window, choose dbt as the Connection type, provide a Connection Name, and fill in the API Token, Host and Account ID collected from dbt Cloud.
  3. When you are finished, click Save Connection

dbt Cloud integration may take up to 24 hours for initial ingestion.



dbt Core

To support ingesting metadata from dbt artifacts for dbt Core workflows, we require you to generate them daily and upload them to an AWS S3 cloud storage bucket provisioned by Stemma.

Stemma’s dbt integration relies on the following pieces of information:

  • The database name (e.g., “snowflake”, “postgresql”)
  • manifest.json and catalog.json (produced via dbt docs generate)
  • Optional: The base url for your dbt github repository (e.g., “https://github.com/{customer}/dbt-{customer}/tree/main
    “). This allows us to includes github links to your dbt models on Stemma’s table details page.

We recommend testing out the integration first by simply sending the two JSON files to your Stemma contact via Slack or email.

For an ongoing integration, we’ll provide you with the following information for the daily delivery of the dbt artifacts.

  • AWS Access Key pair
  • S3 Bucket (e.g, “s3://{customer}-stemma-integrations”)

The S3 prefix path will be “dbt/{date}” where “{date}” is the current days date in UTC in the format “YYYY-MM-DD”. For example, for a file delivery on 04/01/2022 we would expect to see the following files:

  • “s3://customerxyz-stemma-integrations/dbt/2022-04-01/manifest.json”
  • “s3://customerxyz-stemma-integrations/dbt/2022-04-01/catalog.json”

Delta Lake Integration

Proceed as follows to create a Delta Lake connection to Stemma.

Step 1: Generate a Personal Access Token

Follow these instructions to generate the Delta Lake personal access token that Stemma will need to extract your data into the catalog.

Step 2: Provide the Personal Access Token and Hostname to Stemma

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Delta Lake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows:
  • Provide the Personal Access Token you created in Step 1.
  • Provide the fully-qualified domain name of the Delta Lake host; for example mydlhost.mydomain.com.
  1. When you are finished, click Save Connection.

Google OIDC Integration

This document walks you through the steps to create and configure a Google OIDC application that Stemma can use to authenticate users.

Note: You must be a Google administrator to do this.

Steps

  1. Log in to Google Developer Console
  2. On the left, select Credentials
  3. Select + CREATE CREDENTIALS at the top

1664

  1. Select “OAuth Client ID”

948

  1. Enter the following information:
  • Application type: Web application
  • Name: You can choose the name; we suggest “Stemma OIDC” as a descriptive example
  • Authorized JavaScript origins: This must be “https://your-company.stemma.ai
    • Replace your-company with the correct value from your Stemma domain
    • If you have a custom domain name for Stemma, use that value here
  • Authorized redirect URIs: Provide the following URLs; make sure you replace your-company with the proper value:

1034

  1. Select Create at the bottom.

758950

  1. Download the Client ID and Client Secret JSON file and provide them to Stemma by email or Slack.

Hive Integration

What we need from you

Stemma needs certain information and credentials to extract Hive metadata into the catalog. Contact Stemma by email or Slack, and provide the following:

  • Metastore Host: Hostname is the IP address of the Hive server to which you are connecting.
  • Username and Password: You will need to provide a username and password for Stemma to use to access the Hive schema.
  • Metastore Port: Server port used for accessing metadata about hive tables and partitions. The default Hive metastore port is 9083.
  • List of Databases: Stemma whitelists databases, and so you will need to provide a list of the databases we will be importing.

Metadata extracted

The metadata Stemma extracts includes:

  • TBLS – stores basic information about Hive tables, views, and index tables.
  • DBS – stores the basic information of all databases in Hive.
  • PARTITION_KEYS – the field information of the table storage partition.
  • TABLE_PARAMS – stores the attribute information of the table/view.
  • SDS – saves the basic information of file storage, such as INPUT_FORMAT, OUTPUT_FORMAT, whether it is compressed or not.
  • COLUMNS_V2 – stores the field information corresponding to the table.
  • PARTITIONS – stores the basic information of table partitions.

JIRA Integration

This document walks you through the steps you need to take to integrate Stemma with your JIRA installation. You must be a JIRA Admin to do this.

Steps

Step 1 – Collect information about your JIRA environment

Collect the following information to provide to Stemma (see Step 3):

  • The URL for your JIRA environment (e.g., <https://jira.net>)
  • The project ID for the JIRA project you would like Stemma to create tickets in
    See this page for guidance from Atlassian on how to obtain the project ID for a JIRA project
  • If using JIRA Server or Cloud, the ID for the JIRA Bug issue type
    • Note: Skip this step if using JIRA on-premise
    • See this page for guidance from Atlassian on how to obtain the ID for the Bug issue type
  • Optional: The name of one or more labels you would like set on tickets created by Stemma

Step 2 – Create a service account

Stemma will need an email and API token to authenticate to your JIRA environment. Our recommendation is to create a JIRA service account so that the authentication is not tied to an actual user’s account.
a. Go to Administration > User management > Users
b. Click Create User and fill out the required information
c. Create an API token from the service account

  • For JIRA cloud see this page
  • For JIRA server see this page
  • For JIRA on-premise this feature is not currently supported so you will provide the password instead

d. Grant permission to the service account

  • The service account will minimally need access to create and view tickets in the JIRA project you intend to use
  • You can either grant permissions directly to the service account or you can grant the service account a role

Stemma needs:

  • The email address of the service account
  • The API token or password

See Step 3.

Step 3 – Provide the information to Stemma

Provide the information from Steps 1 and 2 to Stemma by email or Slack.

Linking JIRA Issues to Stemma

To link a JIRA issue to a table in the Stemma catalog, you need to include a Table Key tag in the issue description. The tag format is Table Key: {SOURCE}://{DATABASE}.{SCHEMA}/{TABLE} [PLEASE DO NOT REMOVE].

Looker Integration

This document walks through the steps required to create a service account in your Looker, which will be connected to your Stemma Instance for dashboard metadata extraction.

Stemma uses the Looker APIs (via Looker Python SDK) to extract the metadata of Folders, Dashboard, Dashboard Elements and Explore Queries, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Looker. Read Looker’s docs: https://docs.looker.com/reference/api-and-integration

Stemma takes leverage of the API3 keys (Client ID & Client Secret) to make a secure connection between your Stemma and Looker instances. Below you can find out the steps required to create the new API3 keys for this integration.

What is extracted?

Stemma extracts information about Folders, Dashboard, Dashboard Elements, Explore Queries, and data sources associated with them.

Stemma will not extract Dashboards that are deleted, hidden or in a user’s personal folder.

Steps

Step 1: Make sure you have a Looker Admin role

Looker → Stemma integration requires an Admin role to fetch the correct data and build the relationships between the entities like Table Lineage, Dashboard Owners, etc.

So, as a first step, make sure you have an Admin role.

Step 2: Generate API3 keys

The next step is to create an API3 Key. From the Admin menu, navigate to the Users page.

Find the user you want to use to generate the API3 key, and click the “Edit” button under the Actions column.

Once you are on the “Edit User” page, find the API3 Keys settings, and click “Edit Keys

Click “New API3 Key” to generate a new Client ID and Client Secret.

Step 3: Provide the details to Stemma

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Looker as the Connection type, provide a Connection Name, and then fill out the fields in the dialog box as follows:
    • Base URL
      Example: <https://stemma.cloud.looker.com/>
    • Client ID
      The one you generated in Step 2
    • Client Secret
      The one you generated in Step 2
  2. When you are finished, click Save Connection.

Mode Integration

The Stemma Mode integration accesses the Mode Discovery API. This API is distinct from regular Mode API: it is much more performant but only contains a subset of the available endpoints. This means Stemma is able to access only metadata about reports, not the report runs/results. Stemma also accesses the regular REST API to augment the information retrieved from the Discovery API. In all, there are four values required by Stemma:

  1. Your organization name in Mode
  2. The access token to access the REST API
  3. The password token to access the REST API
  4. A signature token to access the Discovery API

Optionally, if you would like Stemma to only extract metadata from specific Mode spaces you may also provide Stemma a list of the spaces to include.

Creating the Access Token & Password Token

You can learn about how to create this access here (https://mode.com/developer/api-reference/authentication/).

The access token and password token allow Stemma to augment the information within the Discovery API with the regular REST API to unlock two additional capabilities:

  1. The ability to link your dashboards to the underlying tables in your warehouse which are used to serve the dashboards.
    • Stemma will use the general Mode API for datasources to retrieve your data source information to find out the name of the database (you may use, PUBLIC, mktg, etc.) as well as the type of data source (e.g. Snowflake, Redshift, etc.) in order to apply the proper SQL parsing semantics.
  2. The ability to rank your dashboards by the number of times the dashboard has been viewed in the past 90 days.
    • Stemma will use the general Mode API for reports API to retrieve an accurate view count.

Creating the Signature Token

To authenticate with the Mode Discovery API a signature token must be created. This signature token can be created by running a curl command using the account’s REST API access token and password secret. The following code is a snippet from Mode that has been slightly adjusted to create a signature token that Stemma can use. Make sure to replace {organization} , {your-access-token} and {your-access-secret} with the appropriate values (organization can be found in the URL when you visit Mode, e.g. https://app.mode.com/<organization>/spaces). Visit the Mode documentation for information on creating the api-key and api-secret.

curl --location --request POST 'https://app.mode.com/batch/{organization}/signature_tokens' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--user '{access-token}':'{your-access-secret}' \
--data-raw '{
    "signature_token": {
        "name": "stemma-api-token",
        "expires_at": "2024-10-02T21:00:00+0000",
        "auth_scope": { "authentication_for" : "batch-api", "authorization_type": "read-only" }
    }
}';

Please note, if this does not work it likely means you do not have permissions. Please contact Mode support with help creating a read-only signature token.

APIs used

APIs are listed in the table below. In the table below, Data Extracted is the data we actually store from the API (and may change as we add functionality). Risks highlights any potentially-sensitive information in the API (whether or not we actually store it).

Please note, that the “Report Stats” endpoint is not enabled by default in mode, so functionality coming out of it may not be available, until the endpoint is enabled.

NameDocsData ExtractedRisks
Chartshttps://mode.com/developer/discovery-api/analytics/charts/Chart name, type, query idLow: contains employee email addresses
Reportshttps://mode.com/developer/discovery-api/analytics/reports/Execution times, dashboard names, dashboard descriptions, creation/last modified times, owner email addresses,Low: contains employee email addresses
Report Statshttps://mode.com/developer/discovery-api/analytics/report-stats/Usage countsLow[1]
Querieshttps://mode.com/developer/discovery-api/analytics/queries/Name, Raw SQLModerate[2,3]
Membershipshttps://mode.com/developer/discovery-api/analytics/members/Mode usernamesLow: contains employee email addresses
Reports: detailedhttps://mode.com/developer/api-reference/analytics/reports/Dashboard view countsLow: contains metadata about the report itself, not the underlying executions of the report
Data Sourceshttps://mode.com/developer/api-reference/management/data-sources/Mode data source name and data source type.Low: contains access to information about the data source. Stemma should separately have access to this information but this API access is required to link a mode dashboard to a specific database.

Notes

Here are some additional notes about possibly sensitive information we may receive from Mode

  1. Dashboard execution times are non sensitive on their own, although it is possible to infer NPMI from who is running which dashboard queries at precise times. Stemma only displays aggregated forms of this information, but stores it disaggregated.
  2. Query names may contain NPMI (e.g. a query titled “EMEA Expansion Potential”, when the company hasn’t announced its intention to expand into EMEA).
  3. Raw SQL queries may contain sensitive information (PII), typically in the form of filter clauses. This is generally not allowed under most security policies especially under Open Collections, however, it is a possibility. By default, these SQL statements will be shown to users in Stemma.

Configuring Your Stemma Connection

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, optionally add a name for the connection, and choose Mode from the Connection Type drop-down menu.
  1. In the dialog window, provide the signature token and Mode organization name, and optionally the access token and password token..
  2. When you are finished, click Save Connection.

Okta Integration

This document walks you through the steps to create and configure an Okta application that Stemma can use to authenticate users.


Note: You must be an Okta administrator to do this.


Steps

  1. Log into Okta admin portal
  2. In the left frame, select Applications :
  1. Select Create App Integration:
  1. Select the options for OIDC – OpenID Connect and Web Application.
  1. In the New Web App Integration, enter the following values:
    • App Integration Name: Your app name, we suggest Stemma Data Catalog
    • Grant Type: select the following –
      • Authorization code
      • Client Credentials (Client acting on behalf of itself)
      • Refresh token
    • Sign-in redirect URI:
      • Use the fully qualified hostname and add the endpoint:
        • /auth
        • If you are running a hosted version of Stemma, this should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name
    • Sign-out redirect URI
      This should be the base Stemma URL. Example:
      • https://<customer>.stemma.ai
    • Trusted Origin:
      • If you are self-hosting Okta, provide the Base URI for your Okta Sign domain, otherwise, leave this blank
    • Assignments:
      • Select the assignment access you would like within your organization
  1. Stemma will require access by certain APIs to retrieve user information from your organization’s directory.
  • Navigate to Okta API Scopes :
  • Make sure to grant permission for the following scopes:
    • okta.users.read
    • okta.users.read.self
  1. Provide the following values to Stemma by email or Slack so that authentication can be built into your Stemma deployment:
    • Okta Discovery endpoint (well-known configs)
    • Client ID
    • Client Secret

752

Redshift Integration

What we need from you

Stemma needs certain information and credentials to extract Redshift metadata into the catalog. To provide these, proceed as follows.

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Redshift as the Connection type and then provide the following information in the dialog box:
    • A username and password for Stemma to use to access the INFORMATION_SCHEMA
    • The hostname of the Amazon Redshift cluster. You can find it in the Properties tab of your Redshift Cluster Details page. Under the Connection details section, copy the Endpoint up to but not including the colon.
    • A port number. The default port number for Redshift is 5439. From your Redshift Cluster Details page, you can find the port number under the Database configurations section of the Properties tab.
    • A list of databases: Stemma whitelists databases, so you will need to provide a list of the databases we will be importing.

Access

The Stemma Redshift integration requires access to two pieces of information within the Redshift install:

  • information_schema metadata
  • Query logs

information_schema

The specific tables and views we access include:

  • PG_GET_LATE_BINDING_VIEW_COLS
  • svv_external_columns
  • COLUMNS

Because of the way permissions work in Redshift, these tables are only populated with data to which the user has SELECT access. If you don’t want to use column stats, and therefore want to restrict access for this credential, Stemma will give you the SQL code required to create a materialized view. You create that materialized view as a superuser, and grant our credential the ability to read and refresh that materialized view.

Query logs

Stemma Redshift requires query logs. We can extract these in two ways: push or pull.

Push

The recommended integration involves configuring Redshift to dump user activity logs into S3.

There are multiple methods to achieve this, either using CloudTrail or directly from Redshift. the exact right way differs between organizations (many already have audit logging configured for other purposes).

Pull

Stemma can extract logs from STL_QUERY. However, because this option requires the credentials be a superuser, it is not the recommended integration method.

Column stats – Optional

The Stemma integration user must have full SELECT access to any tables or views for which you want to see column stats.

Slack Integration

This document walks through the steps required to install a Stemma Slack application in your workspace.

Prerequisite

Stemma does not publish the Stemma Slack application to the Slack Marketplace. To install this application for your workspace, you need to create a new application using an application manifest file.

If you have an application manifest file, make sure you have it handy before you begin. If you don’t have a manifest file, you can copy and paste the following into a file:

_metadata:
  major_version: 1
  minor_version: 1
display_information:
  name: Stemma
  description: Help you enrich your metadata directly via Slack
  background_color: "#0a1930"
features:
  bot_user:
    display_name: Stemma
    always_online: true
  slash_commands:
    - command: /stemma
      url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/slash/help
      ## Replace {SUBDOMAIN} with your Stemma subdomain.
      description: Learn how StemmaBot works
      usage_hint: help
      should_escape: false
oauth_config:
  scopes:
    bot:
      - app_mentions:read
      - channels:history
      - channels:read
      - chat:write
      - commands
      - groups:history
      - groups:read
      - links:read
      - links:write
      - users:read
      - users:read.email
settings:
  event_subscriptions:
    request_url: https://{SUBDOMAIN}.stemma.ai/stemma/slack/events
    ## Replace {SUBDOMAIN} with your Stemma subdomain.
    bot_events:
      - app_mention
      - link_shared
      - member_joined_channel
      - message.channels
      - message.groups
      - team_join
  interactivity:
    is_enabled: true
    request_url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/interactivity
    ## Replace {SUBDOMAIN} with your Stemma subdomain.
  org_deploy_enabled: false
  socket_mode_enabled: false
  token_rotation_enabled: false

Save the file with any prefix and in any location you like, but make sure:

  • The filename has the suffix .yaml.
  • You replace {SUBDOMAIN} with your Stemma subdomain.

Steps

Step 1: Verify your application manifest file

Make sure that at least the following properties in the application manifest file match your Stemma endpoint:

  • features → slash_commands → url:
  • settings → event_subscriptions → request_url
  • settings → interactivity → request_url

Step 2: Create and install your Stemma Slack application

Create a new application using the application manifest file.

a. Navigate to: https://api.slack.com/apps, and click Create an App.

b. You will be prompted to select a method to create a Slack Application. Select From an app manifest, and choose the workspace in which you want to install the Stemma Slack application.1088

c. Paste in your manifest file. Make sure you select the YAML format, and paste in the contents of the file (see Prerequisite above).1088

d. Click Next and verify the scopes and events the Stemma Bot is requesting. Finally, click Create to install the application to your workspace.

Step 3: Provide Signing Secret and Bot Token to Stemma

To secure the communication between Slack and Stemma, you need to provide Stemma with the Signing Secret and Bot User OAuth Token of your newly installed Stemma Slack application:

  • Signing Secret:
    • You can find the Signing Secret on the Basic Information page, in the App Credentials section.

2144

  • Bot User OAuth Token:
    • You can find this token on the OAuth & Permissions page, in the OAuth Tokens for Your Workspace section.

2364

Now provide the Signing Secret and Bot User OAuth Token to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Slack as the Connection type, provide a Connection Name, and then in the dialog box enter the Bot Token and Signing Secret.
  2. When you are finished, click Save Connection.

Step 4: Receive confirmation from Stemma

Stemma will use the information you provide to integrate your Slack workspace with your Stemma Instance. Once you get the confirmation, you should be able to type /Stemma help in Slack to see usage help for the Stemma Slack application.

Step 5: Make sure to add an app icon 😊

Under the Stemma (Slack Application) basic information Tab, add the Stemma icon in the “Display Information” section. Users will be much more likely to use the bot if it’s easy to identify.1024

(Nice to have) Please invite Stemma team member(s) to a shared test channel

At this point, you have the Stemma Slack application installed within your Slack workspace. In order to verify if things are working just fine with no errors, you can create a temporary shared channel and invite a couple of members from the Stemma team to that channel.

For information about using Stemma with Slack, see Using the Stemma Bot in Slack.

Snowflake Integration

What we need from you

Stemma needs certain information and credentials to extract Snowflake metadata into the catalog. To provide these, proceed as follows.

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Snowflake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows:
    • Credentials
      We support authentication with a username and either a password or a private key.
      • Username and Password: You need to provide a username and password for Stemma to use to access the INFORMATION_SCHEMA as described above.
      • Username and Private Key: If you set up authentication via key pairs you will need to provide Stemma with a private key to use when accessing INFORMATION_SCHEMA as described above. If your private key is encrypted, you will also need to provide Stemma with the passphrase.
    • Snowflake Account Name: If you use the snowflake web interface, the URL you use should look something like abc.snowflakecomputing.com. Everything that appears before Snowflake computing is your account name.
    • Warehouse: The name of the warehouse (also called Virtual Warehouse) Stemma will connect to.
    • Role: The role Stemma will use once connected. If Stemma is extracting query logs, this role will need to have MONITOR access on the warehouse.
  2. When you are finished, click Save Connection.

Access

The Stemma Snowflake integration requires read access to all tables that will be included in the catalog, plus query logs.

Metadata

The specific tables and views we access include:

  • TABLES
  • COLUMNS

Due to the permissions model of Snowflake, these tables are only populated with data to which the role has SELECT access. Metadata will be extracted one database at a time since each Snowflake database has its own INFORMATION_SCHEMA.

Query logs

Stemma extract logs from the QUERY_HISTORY table function in the INFORMATION_SCHEMA. Logs are used to inform lineage, usage, common usage patterns and more. Due to the permissions model of Snowflake, this table is only populated with queries executed in virtual warehouses that the role has MONITOR access on.

Access History

Stemma uses ACCESS_HISTORY to enrich lineage derived from query logs. To grant Stemma access, Stemma’s user role will need to have the IMPORTED PRIVILEGES:

grant imported privileges on database snowflake to role stemma_role;

If Stemma cannot read Access History, lineage will still work however accuracy will be reduced.

Note: If you try giving the Stemma role access only to the required view, without giving access to the SNOWFLAKE database and ACCOUNT_USAGE schema before that, Snowflake will probably reject your request with the “Grant not executed: Insufficient privileges.” error.

Tableau Integration

Usage information

The Stemma Tableau integration will sum view counts for all workbooks in a particular dashboard over all time.

Installation Instructions

This document walks through the steps required to create a service account in your Tableau, which will be connected to your Stemma Instance for metadata extraction.

Stemma uses the Tableau Metadata API (GraphQL) to extract the metadata of projects, workbooks, dashboards, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Tableau.
Read Tableau’s docs: https://help.tableau.com/current/api/metadata_api/en-us/index.html

Connection Type:

We support both Online and Extract Connections with no differences.

Steps

Step 1: Decide your authentication strategy

The Tableau Metadata API is authenticated using a Personal Access Token (PAT) attached to a specific user within your organization. You will always share with us the PAT, not the username/password of the user itself. The PAT has the same permissions as the user to which it is attached.

Most users simply will simply create a PAT attached to their admin account. This is the easiest to set up and simplest to manage longer term, and we recommend it. For this option, skip to Step 4.

However, because this means the PAT has write access to the instance and Stemma doesn’t use write access at all, this doesn’t strictly follow principle of least privilege. In cases where restricting such access is critical, we recommend creating a read-only service account, and then you’ll generate a PAT for that limited account. Continue to Step 2 for this option.

Step 2: Create a new Viewer user

If you decide to create a new user for this integration (otherwise skip to Step 4 to use your current user to generate the Personal Access Token), navigate to the Users menu from the side navigation.

Click on “Add Users” and select “Add Users by Email“. A popup will appear with a Form to create new Users.

Select the “Tableau” radio button, and write down the email address you want. You will not share this email address with Stemma, and use this to sign in to the Tableau instance in order to create a PAT.

Email Tip: If you use Google email, [your_current_user][email protected][your-domain] will allow you to receive emails on your current email address and satisfy the uniqueness requirement.

Select “Viewer” as the Site role, which will give this user minimum access to the Tableau, and click “Add Users” to finalize the user creation.

Step 3: Set up permissions for the user

The next step is to make sure the user has permission to read your externally connected data sources. This will be used to link the workbooks with the tables in Stemma. By default, a “Viewer” user will not have access to any data source.

Click on the “External Assets” from the side navigation, click on “Select All“. A new dropdown will appear. Click on the “Actions” dropdown and select “Permissions“.

A Permissions dialog will appear under the “Permission Rules” section. By default, you will be on the “Database” tab.

Click “+ Add Group/User Rule” and select the user you created in Step 2 of this document, set the Template “View” and Save the rule.

Click on the “Tables” tab, and repeat the steps to add the rule for the user with the “View” template.

Stemma uses information about the upstream data sources in order to automatically create data lineage from your dashboard to your upstream table. In order to enable this, Stemma needs access to read the data source metadata. To enable the Stemma user permissions required by the Tableau Metadata API, you will need to provide “View” and “Connect“” access to each data source and workbook that Stemma will read from.

Step 4: Create a Personal Access Token

Log in to your Tableau instance with the user you want to use for Stemma integration. If you created a new user for this integration, make sure you log in as that user.

Select My Account Settings from the dropdown of your profile.

Scroll down to the “Personal Access Tokens” section, enter the token name “stemmaViewer” and click “Create new token”.

A new pop-up will appear with the personal access token and secret. Make sure you copy these somewhere safe as they will not be displayed again.

Step 5: Provide the details to Stemma

a. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:

b. In the pop-up window, choose Tableau as the Connection type, provide a Connection Name, and then in the dialog box fill out the fields as follows:

  • Base URL
    Example: <https://sample.online.tableau.com>
  • Access Token Name (see Step 4).
  • Access Token Secret (see Step 4).
  • Tableau Site Name

c. When you are finished, click Save Connection.

Optional:

You can opt to exclude or include some items by providing the following to Stemma by email or Slack:

  • Name of Projects to Exclude: If you want to exclude any sample projects or test projects from Stemma, provide the names of those projects.
  • External Assets Types to Include (Connection type):This is uncommon, but if you want to include and link only certain types of external assets with your Tableau workbooks in Stemma, provide the names of those assets you want to include and link. Note that this may limit the user experience of Stemma, and is not recommended.

Connecting to AWS S3

Stemma supports two methods of transferring files:.

  1. Stemma creates an S3 bucket: you provide Stemma the ARN for a role/group that should have access to read to and write from the bucket. You then provide the role/group to the users or resources that need access to the bucket. Stemma recommends this method.
  2. You create the S3 bucket: Stemma provides you the ARN of the role that the Stemma app uses. You will need to add the following policy to the bucket to allow Stemma access. Make sure you replace <BUCKET_NAME> and <ARN_FROM_STEMMA> with the appropriate values.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "StemmaBucketAccess",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<BUCKET_NAME>/*",
                "arn:aws:s3:::<BUCKET_NAME>"
            ],
            "Condition": {
                "ArnLike": {
                    "aws:PrincipalArn": "<ARN_FROM_STEMMA>"
                }
            }
        }
    ]
}

Athena Integration

Stemma’s Athena integration supports: extracting metadata information (tables, columns, etc)

Proceed was follows to create an Athena connection to Stemma.

Create a user

To integrate with Stemma we recommend creating a new user in IAM that has access specifically scoped to the actions and resources required. The following script enables the required access:

Terraform

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.0"
    }
  }
}

provider "aws" {
  # You are only creating IAM which are global, you may
  # change the region but it should not impact this script
  region = "us-east-1"
}

resource "aws_iam_user" "stemma_read_user" {
  name = "stemma-read-user"
}

resource "aws_iam_user_policy" "stemma_read_policy" {
  name        = "stemma-read-policy"
  user = aws_iam_user.stemma_read_user.name

  policy      = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "athena:GetDatabase",
          "athena:ListDataCatalogs",
          "athena:GetDataCatalog",
          "athena:ListDatabases",
          "athena:ListTableMetadata",
          "glue:GetDatabases",
          "glue:GetTables"
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })
}

Generate credentials

After creating the user, generate an AWS Access Key and Secret for the user.

Provide credentials to Stemma

Now provide this information to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose AWS as the Connection type, provide a Connection Name, choose Access Key as the Authentication type, provide the AWS Access Key and AWS Access Key Secret you have just created, and specify the AWS Region:
  2. When you are finished, click Save Connection.

Testing the access

Optionally, you can validate the access to the catalog, databases and table that Stemma will have with the following code snippet; make sure to replace ACCESS_KEY and ACCESS_KEY_SECRET with the values for the Stemma Athena user:

import json
import boto3
from botocore.config import Config


client = boto3.client(
    "athena",
    aws_access_key_id="{ACCESS_KEY}",
    aws_secret_access_key="{ACCESS_KEY_SECRET}",
    config=Config(region_name="us-east-1"),
)
all_tables = []
for catalog in [cat['CatalogName'] for cat in client.list_data_catalogs()['DataCatalogsSummary']]:
    for database in [database["Name"] for database in client.list_databases(CatalogName=catalog)['DatabaseList']]:
        for table_name in [tbl['Name'] for tbl in client.list_table_metadata(CatalogName=catalog, DatabaseName=database)['TableMetadataList']]:
            all_tables.append(f'{catalog}.{database}.{table_name}')
            
print('All tables available:')
print(json.dumps(all_tables, indent=2))

What Information Does Stemma Extract Automatically?

The table below shows what information you can expect to see in Stemma after you have integrated all of your applications.

SourceMetadataUser UsageLineageQuery UsageColumn Stats
Snowflake
Redshift
BigQuery
Postgres
Athena
Delta Lake
  • Metadata: tables and columns, descriptions. Where available, the time that the table was last updated. See also Sharing Additional Metadata.
  • User usage: which users are using which tables in the database.
  • Lineage: how data moves between tables, views, and dashboards. See also Tracking Data Lineage.
  • Query usage: how tables are joined; common patterns for using tables together.
  • Column stats: statistics (mean, min, max, null count, etc.) for columns in a table.

For more information, see:

Can I Choose What Data Stemma Ingests?

If you don’t want Stemma to ingest all the available data in the applications you integrate (for example if you want to ingest just one schenma from a data lake), you can tell Stemma what to ignore: Contact Stemma by email ([email protected]) or talk to your support rep.

If there's a Curated and an Ingested Description, Which Will I See in Stemma?

If Stemma ingests the description of a table or column (for example, from Snowflake) and you subsequently edit or replace that description in Stemma, Stemma users will see that curated description from then on, even if Stemma subsequently re-ingests the original Snowflake description (or any other version of it).

At present, Stemma does not provide a way for you to see both the Stemma description and the ingested description side by side.

What Happens when a Table or Dashboard Is Removed?

Removing a table or dashboard causes it, by default, to be removed from Stemma after ten days. Similarly, if the table or dashboard is renamed or moved, the original entry is removed from Stemma after ten days.

Before removing an asset, you should deprecate it to give stakeholders sufficient warning.

Stemma Docs

Getting Started

What is Stemma?

Stemma is a data catalog built for users of the modern data stack. It uses automation and workflow integration to help data owners get documentation into the catalog, while making it easy for end users to discover and trust data.

Tools for the Data Team to Document Data

Auto-generated metadata handles basic stewardship so no asset starts blank. Data that is automatically generated includes:

Slack integration to capture context where and when it is expressed

WYSIWYG editor to easily annotate assets with text, images, and existing docs

Tools for Analysts to Find Data

Track data lineage

Explore and contribute to the Glossary

Browse tables:

  • By source type
  • By tags

Search and filter for columns

Tools To Keep Data Owners in Contact with Users

How to be Successful with Stemma

Kickstart documentation with automated data stewardship

Manage data changes with advanced lineage tools

Enhance existing user workflows by integrating the tools you already use


Integrating Your Applications with Stemma: What We Need from You

What we need to get you up and running depends on the applications you will be integrating with Stemma, but in general you will need to provide the information and do the tasks for Slack, SSO, and your dashboard and storage applications.

Instructions are on the following pages.

Information Stemma Extracts Automatically

This page shows what information you can expect to see in Stemma after you have integrated all of your applications.

Does Stemma Read My Data?

In general, Stemma reads metadata (queries, who ran them, names of columns or charts, etc.), not the data itself.

  • For dashboards, Stemma reads only metadata and doesn’t have access to numbers, or read charts or reports within the dashboarding system.
  • For tables the same is true, with one exception: if you choose to use the column statistics feature, Stemma reads the data to obtain that information.

For more information:

Next Step

Once you have integrated your applications into Stemma and everything is tested and working, you need to plan the rollout to your users.

Guidelines for a Successful Stemma Rollout

Every Stemma rollout is different, depending on your organization’s particular needs, your data storage and management applications, and the people who use them in various ways and with varying priorities. So the following are guidelines, not rules, but they are based on the accumulated experience of Stemma’s founders and engineers, and are a good place for you to start.

Criteria for Success

The two criteria that matter in measuring the success of a data catalog are

  • adoption; and
  • customer satisfaction

If your rollout achieves both of these in a high degree, you can declare victory in the knowledge that you have improved the lives– and in particular the productivity, efficiency, and awareness– of your users and are contributing significantly to your organization’s success.

The following broad steps are designed to help you get there.

Steps for Rollout and Adoption

Step 1. Start with a persona and its use-cases

There are many user personas and use cases for a data catalog. Here’s a simplified view of the most common ones. It’s less important which persona you start with than that you choose a specific group of alpha users.512

Step 2. Launch in phases

Phase 1: Identify a small set of tables to get alpha user feedback on.

This set can be your organization’s most commonly used tables (often referred to as “core” tables) or one domain within the company, such as marketing, growth or finance, etc. Core tables are often the best choice, because they have the most impact and there’s often a central data team which is responsible for maintaining them.

Phase 2: Populate MVP metadata on these tables.

  • Stemma can ingest the majority of this metadata automatically (see What Stemma Enables You To Do) leaving you and your users to curate only certain important items such as ownership, table descriptions and table status
    .* Where you must, to glean tribal knowledge, it may also help to do a “docs jam session” with a group of data producers and consumers. (You might even offer a reward such as a gift card for those who enter the most documentation!)

Phase 3: Alpha launch to 5-20 alpha users.

  1. From the persona you decided on earlier, choose highly vocal users and those with the most tribal knowledge. These users will become Stemma’s avid supporters when you launch to a broader audience. (Note:Make sure these alpha users have the access they need to do work in Stemma (see SSO integration under What We Need From You) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake
  2. Incorporate feedback and iterate. Look particularly for feedback that indicates opportunities for productive metadata propagation and cross-referencing (for example, someone might say something like, “Oh, we already have this metadata in this spreadsheet — we should pull that in here, too.” )

Tasks your alpha users should become comfortable doing include:

  • Search – Search for tables and dashboards using a Page-Rank style search

720

  • See table details:
    • Description and status:
    • Issues (JIRAs)
    • Last updated
    • Data owner(s):
336

441

  • See column descriptions:

1296

  • Perform table actions:
    • Edit table status, description, issues, owners
    • Add/edit/remove tags
336

144

  • Show items commonly used with tables:

1008

  • See table lineage
  • Search for dashboards and see dashboard details:
    • Owners
    • Created time
    • Last updated time, etc.

308

  • See queries run by the dashboard, etc:

720

Phase 4: Beta launch to all users of the prioritized persona.

  1. Focus your beta launch on your chosen persona (data consumers, for example). Don’t blur the focus of your launch by opening it up to all personas. Those other personas will come on board at GA, after your alpha and beta users have thoroughly learned and exercised the product and can serve as its advocates and in-house experts.
  2. Graduate to GA if you can meet success targets.

Step 3. Achieve widespread adoption

Best practices for achieving the greatest adoption include:

  • Update Slack channel headers where people ask each other questions. Stemma’s Slack integration links these conversations to the catalog automatically.
  • Embed Stemma training in new hire training. Tagging data sets by domain (marketing, growth, etc.) can help new hires quickly onboard to their domains. Showcase Sremma as an entry point into your existing training, for example by having all technical new hires instrument a metric during onboarding.
  • Create linkages with other products. Create links between various data tools. For example, auto-populate a link between an Airflow DAG that populates a table and the table page in Stemma (and vice-versa). (See Sharing additional metadata). Another productive linkage is between the table page in the data catalog and a link to the code that is used to generate the table.
  • Showcase the catalog at a group or company meeting. Deliver a short five-minute demo at a meeting that includes your target users. Educate, answer questions, and thank your alpha users — this creates more awareness and provides an opportunity for everyone to learn.

Step 4. Measure success

1. Adoption

  • Track WAUs (Weekly Active Users).
    • Start with Weekly Active Users (WAUs) rather than Daily Active Users or Monthly Active Users. Common usage frequency is weekly, not daily or monthly.‍
      • Target Penetration rate: 80%. A great penetration rate is 80% of WAUs within your target persona.

2. Customer Satisfaction (CSAT)

  • Measure CSAT periodically.
    • Send out a survey every three to six months asking users to rate their experience with Stemma.

Other metrics that organizations often consider include documentation quality, search quality, etc. But it’s best to stick to the above core metrics at the outset. As your Stemma implementation matures and ingests more metadata over time, you can begin to broaden your criteria.

Migrate from existing Amundsen

Migrating from Amundsen involves three main steps:

  1. Migrating existing metadata from Amundsen to Stemma
  2. Ingesting metadata continuously
  3. Cut over

1 – Migrating existing metadata from Amundsen to Stemma

In this step, you take all the existing metadata from Amundsen and load it into Stemma. The metadata imported includes, but is not limited to:

  • Table and column descriptions
  • Tags
  • Ownership information

To import this metadata, upload a dump of your Amundsen neo4j metadata into Stemma:

  1. Bring the Amundsen storage offline.
  2. Use dump -> load, available on all neo4j installations, to dump and upload your data.

After import, your Stemma instance will look very similar to your existing Amundsen instance:512

But your data won’t be periodically updated until you enable periodic updates; see the next step.

2 – Ingesting metadata continuously

You need to configure Stemma to ingest metadata updates on an ongoing basis. Once this configuration is done, you won’t need Python databuilder jobs to ingest data.

Information ingested in this step includes:

  • Table and column names
  • Linked issues (JIRA tickets)
  • Frequent users
  • Lineage, if it exists

IIn Stemma’s Admin interface, provide credentials to your data sources to allow Stemma access to extract metadata on an ongoing basis:

374

Stemma will now do the work to “upsert” metadata updates, as they occur, into the data you uploaded from Amundsen in the previous step.

3 – Cut Over

At this point, Amundsen and Stemma are running side-by-side, and Stemma recommends sharing access to Stemma with your power users and getting their feedback to ensure all of their use-cases are supported. Usually one week of overlap to obtain feedback and an additional week to incorporate feedback, if applicable, is sufficient.

When all looks good, simply redirect the URL of your internal Amundsen to the Stemma URL.

If you have any feedback or questions about moving from Amundsen to Stemma integration, please reach out to [email protected]!

Using Stemma

Here you’ll find instructions on how to use all the features in Stemma.

Using the Stemma Bot in Slack

Stemma’s goal is to capture information about data no matter where it happens. Slack is increasingly where conversations about data exist, and Stemma has a super simple integration that allows you to pin Slack conversations to tables.

Here’s how it works. If someone asks a great question, or posts a great answer about a piece of data in Slack, you can open a thread and tag the Stemma bot:748

You can now go to the Table Detail page in the Stemma UI to see the linked conversations:1186

Where can you use the Slack bot?

You can use it in public or private channels. You just need to invite the bot into the channel. It cannot be used in direct messages.

Some other things to keep in mind about the integration:

  • You must respond in a thread
  • You need to give a precise definition. If a table is only present in one schema, “@Stemma table_name” will work just fine. But if it’s present in multiple schemas, you’ll need to use “@Stemma schema.table_name”

If you have any feedback or questions about the Slack integration, please reach out to [email protected]!

Messaging Users

Stemma provides a quick and easy way to send messages to specific groups of users on the basis of their relationship to a table (e.g. are they an owner in Stemma, have they queried the table in the prior three months, etc.). If you need to announce a future change, for example, or notify users when significant events occur, such as when a table has been deprecated, you can use Stemma messaging to ensure that the affected users will receive the notification.

🚧 Before you begin – enable email and Slack notifications

If you want to enable Slack notifications, make sure that you have completed the Slack Integration

Notification sources

There are two forms of notifications within Stemma:

  1. Table status changes: you can send a message to affected users when the status is changed, for example from Certified to Deprecated.
  2. Custom messages: You can create your own message and choose which groups of users and teams to send it to.

Table status changes

You can notify the owners of a table when the status changes.

📘 Note

This feature sends a notification to all owners of the resource. If you want to notify other selected user groups, you can use the Message functionality. Learn more about user groups.

Custom messages

You can compose a custom message to users and teams related to a table by clicking the Message button on the top-right corner of the Table detail page.

Select the user groups you want to notify and type your message in the box.

Notification types

There are three types of notifications you can send in Stemma, depending on the receivers and your personal preferences. If a member of a related group is a user, they can receive either email notifications or Slack private messages. If a member of a related group is a team (Slack channel), they will receive a message in Slack.

Slack notifications

After you have completed our Slack Integration, you can send and receive messages from the Stemma bot.

Messages in channels

If a member of a related group is a team, they will receive a message from the Stemma bot in the corresponding Slack channel.

📘 Note

Make sure you add the bot to all channels that need to receive messages.

Slack private messages

You can choose to receive notifications from Stemma as Slack private messages. To receive messages from the bot, you need to add the Stemma application to your Slack configuration.

Click **Browse apps** in the **Apps** section located on the bottom-left side of the screen
Click Browse apps in the Apps section located on the bottom-left side of the screen
Select the **Stemma** app from the list.
Select the Stemma app from the list.

📘 Note

To modify your preferred notification method, contact your Stemma representative.

User groups

Each group contains a list of users and teams (Slack channels). In Stemma there are four different types of user groups related to a table:

  • Owners of the table
  • Owners of dashboards using the table
  • Owners of derived tables
  • Frequent users of the table

Stemma will only show the options that apply to the given asset. For example, if a user is sending a custom message for a table that does not have any downstream lineage there will not be an option for Owners of derived tables.

Using Tableau with Stemma

Stemma works with Tableau to help answer common questions data analysts have, such as:

Does this dashboard already exist?

Stemma indexes Tableau dashboards along with their underlying tables, providing a one-stop search for all data assets. Stemma also tracks how many times a dashboard has been viewed, instantly highlighting which is most commonly used.1400

Is the right data being used?

Stemma automatically correlates your Tableau dashboards to the corresponding table asset by reconciling the table names extracted from the Tableau Metadata API with the existing table metadata Stemma has captured from your warehouse. This allows you to start your exploration by finding trustworthy reports and, in turn, using the lineage to discover which tables are useful to you.1388

You can also explore the table’s lineage upstream or downstream to get an end-to-end view of where the data came from.1391

How can I use the metrics in this dashboard?

Tableau provides Stemma with the SQL that it creates to generate each sheet, and Stemma indexes this SQL so you can easily find the code. You can copy and paste the code into your IDE to jump-start the SQL exploration.1600

Best Practices for Data Ownership

👍 The three types of ownership that work well

It works well to break down ownership of the most important date into three distinct roles:

  • Delivery owner — Ensures that this particular data gets delivered on time, on a promised delivery SLA. Usually, this is a data engineer or analytics engineer responsible for developing and maintaining the pipeline that produces this data.
  • Domain owner — What does this particular value in a field (or column) mean? When does this particular event get triggered? Usually, this is a product engineer who created the event or the analysts and data scientists who use this data most frequently and understand the physical reality that the data represents.
  • Policy owner — Ensures that this data gets used according to the classification and terms associated with it. Sometimes you acquire data from a source that should not be used for a certain category of use-cases. For example, YouTube is permitted to show ads to kids, but not permitted to personalize them. Therefore, the personalization data can’t be used if the subject is a child. The person making these calls is usually not an engineer or data scientist, but someone on the Policy or Privacy team at the company.

👎 Types of ownership that often don’t work as well

  • Overall Quality owner: In practice it’s often hard to find people able and willing to take ownership of the end-to-end quality of a data set. This is because data engineers don’t consider themselves owners of the data that’s produced from the upstream application and don’t want to be responsible for hunting down a website bug that impacts the data in the warehouse. Product engineers, on the other hand, don’t have enough context about how data gets joined and transformed downstream to own the final derived data artifact. This may change as decentralized data management (or data mesh) is deployed more broadly.
  • Shared ownership: This often does not work in practice, though it can work well if the owner group shares a good understanding of the different types of ownership and the group can efficiently redirect each question to the right person.

🏥 The data engineer’s role

In practice, the data engineer plays the role that a triage nurse would play in an Emergency Room. When an issue arises (like a patient arriving in the ER), the data engineer triages to see what’s going on. Sometimes it’s a problem the engineer can fix), so they fix it and resolve the issue (like the nurse treating an injury on the spot). In other cases, the engineer redirect the issue to the appropriate owner (as in a referral to a doctor or other health practitioner). In short, the issue determines who the owner should be.

Using Stemma

Use Stemma to assign and document ownership. Divide ownership as follows:612

Stemma makes it easy to assign table ownership broadly: see Assigning owners in bulk

Assigning Ownership of Tables

You can assign ownership of a single table to individuals or teams. You can also assign owners to a set of tables in bulk, adding all the owners in a single transaction.

Note: You can assign owners to tables in Stemma, but not to dashboards. Stemma ingests ownership of dashboards from the underlying BI tool(s) and you need to make updates and modifications in those tools.

The importance of ownership

Ownership is one of the fundamental pillars of a data catalog. It defines responsibility by individual users and teams, adding tremendous value to the data in the catalog, including:

  • Delegation of duties: Which users are responsible for keeping this table properly documented?
  • Compliance with processes and procedures: Has the right team updated the documentation for this table before the newest pull request is merged?
  • Reporting and analytics: Which teams have the highest percentage of their assets documented? The lowest?

For advice on assigning ownership, see Best Practices for Table Ownership

Assigning ownership to an individual table

You can assign ownership of a table either to individuals or teams:

  • Individuals are the people in your company who use Stemma. You can find individuals by their email address or a Slack handle.
  • Teams are represented in Stemma by Slack channels.

To assign one or more owners to an individual table, choose the table in the Stemma UI and click on Add Owner at the bottom of the left panel.

But Stemma also allows you to bulk-assign table ownership across the catalog. You can do this in a single transaction, immediately allowing your users to identify the assets that belong to them.

Assigning owners in bulk

You may think about ownership as a hierarchy, or your teams may typically own all of the tables within a single database or schema, or perhaps your analysts use Dbt to create their own workflows and create tables with specific naming conventions (e.g. marketing_salesmarketing_prospects). Regardless of how your company thinks about grouping assets, Stemma can help you to bulk-assign owners, using advanced search to find any of the assets you want to update and assigning the owners with a single command.
To assign ownership of tables in bulk, proceed as follows:

  1. From the home page of the Stemma UI, navigate to Advanced Search.
  2. At the top of the left panel, make sure only the Tables resource is checked (you can’t bulk-assign owners to dashboards).
  3. Optionally use the filters in the left panel to limit the search results to the set of tables to which you want to bulk assign ownership:
    • Stemma suggests you limit your search results using only the filters on the left; while you can also use the search bar at the top, doing so makes it more likely that tables that you don’t want included slip into the results
  4. Click on Bulk Edit at the bottom right of the page and choose Add Owners.
  5. In the dialog box, enter or select an email address or Slack channel and choose Add.
  6. To add more owners, repeat Step 5 for each new owner.
  7. When you are finished, click Add Owners and then Done.

Stemma automatically pulls in the teams you’ve defined as well as the list of users in your organization:

Note: When you assign ownership in bulk, this does not trigger automatic messages to the owners and users; that only happens when you change to assign ownership individually.

Displaying Data Lineage in the Stemma UI

Tables

  1. Navigate to the details page for the table.
  2. Click on Lineage graph in the menu bar at the top of the page.
  3. You will see something like this:

From here you can:

Adjust the view: zoom in, zoom out, center, and fit to the page:

then choose Table Details to Table Lineage

See the table’s details page in the Stemma UI and a new lineage graph for each upstream and downstream table: click on the arrow and then choose Table Details or Table Lineage:

See and highlight the lineage of each table:

Collapse tables into groups: first hover over the logo:

hover over the logo

Then click the stack:

click the stack

To expand the groups again:

First click the down arrow
Then click the stack

Search, for an exact match only, on table properties. (In this example, statewide_testing will match, but not  statewide testing):

Columns

  1. Navigate to the details page for the table.
  2. Click on Columns in the menu bar at the top of the page.
  3. Click on the up-arrow next to the column whose lineage you want to see.
    Stemma shows the top five upstream and downstream columns.

Dashboards

  1. Navigate to the details page for the dashboard.
  2. Click on Lineage graph in the menu bar at the top of the page.
  3. You will see something like this:

From here you can:

Adjust the view: zoom in, zoom out, center, and fit to the page.

See the tables used by the dashboard and the columns used from each table:

See and highlight the lineage of each table:

Show the table’s lineage in an individual graph and the table’s details page in the Stemma UI; click on the arrow and then choose Table Details or Table Lineage:

Search, for an exact match only, on table properties (for example, statewide_cases will match, but not  statewide cases).

Collapse tables into groups and expand the groups again (see Tables).

Using Data Lineage Graphs

Use Data Stemma’s Lineage Graphs to:

Some examples

  • An analytics engineer builds a table using Dbt; that table becomes popular and several dashboards are built on top of it. It becomes well trusted, and the logic becomes a standard for calculating certain metrics. But the table is missing several important entity IDs (foreign key columns) which makes it difficult to join to other tables. The analytics engineer and a data engineer work together to complete a “micro migration” to a new, more performant and reusable table.
  • A machine-learning model is used to predict churn for a customer. Each day after the model runs, users that are predicted to churn are identified in a table in Snowflake and then sent an email to re-engage them. The product team is launching the second major product and wants to measure churn on a per-product basis. The engineering team updates their data pipeline to add a new column to the output table that identifies the product that churn is being predicted for.
  • An analyst created several views to consolidate the common logic that the team were using to query tables in their dashboards. But the performance of the views has deteriorated as the amount of data has grown. The analyst and the engineer work together to create materialized versions and to update all of the dashboards using the prior views.

Use Data Lineage in Stemma to Identify Affected Assets

The challenge with such changes is that it is hard to accurately curate a list of all assets that will be affected. If you cannot be certain of what will need to be changed or what could break, you cannot do risk analysis or even make a time estimate. Stemma enables you to do the analysis and create the appropriate messaging.

The key question for this type of change is “What tables and dashboards will I need to update if I rename this view?”

Stemma provides a lineage graph for each table and dashboard. From here you can see who and what is utilizing your resources downstream, and how popular your assets are. Data owners can use this in conjunction with their own knowledge to build roadmaps and plan how data should evolve.

Use Stemma Messaging to Communicate Expected Changes

Everyone affected needs to know what is going to change and when. After building your list of affected assets and their users, use Stemma messaging to provide advance notice and instructions.

How Stemma Discovers Data Lineage

When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.

In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.

Note

When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.

See also: How Stemma Captures Data: SQL Parsing Overview and Caveats

Sharing additional metadata via CSV files

Along with a direct connection to various data sources, Stemma can ingest data using extracts in various file formats. One of the most commonly used file formats is CSV.

Below are some of the CSV Formats Stemma ingests today (along with the sample files), with the column names and values we expect in each file. Note that you need the column header literals (for example,db_name, etc.) as well as the values.

Github Links

Used to provide a link to a Github source from Stemma’s Table detail page.

db_nameclusterschematable_namesourcesource_type
hivegoldtest_schematest_table1https://github.com/amundsen-io/amundsen/github

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
source: the Github URL for the repository
source_type: static value, “github”

Airflow

Used to generate a link between a table and an Airflow task. Once the data is ingested, you will see an “Airflow” button on the Stemma detail page for the table.

task_iddag_idexec_dateapplication_url_templatedb_nameschematable_namecluster
hive.test_schema.test_table1event_test2018-05-31T00:00:00https://airflow_host.net/admin/airflow/tree?dag_id=SUPER_AWESOME_DAGhivetest_schematest_table1gold

task_id:: the Airflow ID for the task
dag_id: the Airflow ID for the DAG
exec_date: the most recent execution timestamp for the task
application_url_template: the URL for the DAG in your Airflow instance
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
cluster: the name of the database within the data source.

Dagster

Used to link a Dagstr op and/or job to a table in Stemma. Once the data is ingested, you will see an “Dagstr” button on the Stemma detail page for the table.

application_url_templatedb_nameclusterschematable_name
https://dagster_host.net/dagster/tree?job=dagster_jobsnowflakeprodtest_schematest_table1

application_url_template: the URL for the DAG in your dragster instance
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.

Dbt Cloud

Used to link a Dbt Cloud job to a table in Stemma. Once the data is ingested, you will see a “Dbt” button on the Stemma detail page for the table.

application_url_templatedb_nameclusterschematable_name
https://dbtcloud_host.net/dbtcloud/tree?job=dbtcloud_jobsnowflakeprodtest_schematest_table1

application_url_template: the URL for the DAG in your Dbt Cloud
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.

Table Tags

Stemma can create and assign tags via a CSV file, with the following format:

db_nameclusterschematable_nametags
hivegoldtest_schematest_table1“tag1,tag2”
hivegoldtest_schema_2test_table2“tag1,tag3”

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
tags: the tags to be created. Tags are additive: existing tags on tables will not be modified.

Table Description

db_nameclusterschematable_namedescription
hivegoldtest_schematest_table1Description for table, optionally markdown.

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
description: the table description. Note that descriptions are destructive: if a description already exists, it will be overwritten.

Table Owners

db_nameclusterschematable_nameowners
hivegoldtest_schematest_table1[email protected],[email protected]
hivegoldtest_schema_2test_table2“#test-channel,#test-channel-2”

db_name: the data source name, for example, hive, snowflake, athena, redshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
owners: email addresses or slack channels. Owners are additive: existing owners of tables will not be modified.

Column Descriptions

db_nameclusterschematable_namecol_namedescription
hivegoldtest_schematest_tabletest_columnThis is an example column description.

db_name: the data source name, for example, hive, snowflake, athena, redshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
col-name: the column described.
description: the column description. Note that column descriptions are destructive: if an existing description is present, it will be overwritten.

Finding Important Information about Tables and Columns

Stemma provides out-of-the-box integration with applications such as Snowflakedbt, and dashboarding systems such as Looker to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.

The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.

The table details page provides following information:

Table attributes

Asset status:

This can be CertifiedDeprecatedIntermediate, or No Status:

336

Description:

A brief summary of the contents of the table:

338

See Advanced Description Editing for more information.

Issues:

JIRA issues logged against this table:

441

This section explains how to link a JIRA issue to the Stemma catalog.

Owners:

The person or people responsible for maintaining this table:

441

Click on an owner to see other assets owned by that user, as well as their bookmarks and frequently used tables. You can also email the user via this profile page.

See also Assigning Bulk Ownership of Tables.

Last Updated:

The date and time of the last update to the table (this can be different from the most recent date covered by the data itself):426

Date Range:

The date range covered by the data in the table.

173

Note that this need not be synchronous with the Last updated date and time, which simply reflects the date of the most recent modification to the table.

Tags:

Descriptive tags for cross-referencing, such as covidcases, etc.:

187

Frequent Users:

People who have frequently used this table:

288

Click on a user to see assets owned by that user, as well as their bookmarks and other frequently used tables. You can also email the user via this profile page. See also How Does Stemma Calculate “Top 5 Columns”, “Frequent Users”, etc?

Commonly Used With

Links to tables often used with this one (see Using Commonly Used With). See also How Does Stemma Calculate “Top 5 Columns”, “Frequent Users”, etc?

288

Table lineage graph:

See Displaying Data Lineage in the Stemma UI :

Dashboards:

Dashboards that use the table:864

See also Finding Important Information about Dashboards.

Upstream, Downstream:

Upstream and downstream tables:

:576576

See also Displaying Data Lineage in the Stemma UI

Slack Conversations:

Slack threads that mention this table:864

Column attributes

Column description:

Brief description of the column contents:864

See Advanced Description Editing for more information.

Data type:

Text or Number:

750

Column Statistics:

For example, minimum and maximum values over a period:720

Shared descriptions:

See Sharing Column Descroptions.

Column usage and lineage:

The most-used upstream and downstream columns, as measured by the number of queries, dashboard views, catalog views, etc.:

288

See also Tracking Data Lineage and How Does Stemma Calculate “Top 5 Columns”, “Frequent Users”, etc?.

Curated versus automated metadata

Where possible, Stemma captures the information directly from the source. This table shows which attributes are populated from the source and which you can edit:.

Table MetadataAutomatically captured?Users can edit?
Table description.
Asset status.
Owners✅ (dbt)
Tags✅ (dbt)
Jira issues
Last updated time
Data date range
Commonly used with tables
Frequent users
Upstream & downstream table lineage
Dashboards that use the table
Slack threads with table mentions

Notes

Editing curated items

To add or edit a curated item, hover you cursor over over it until you see a pencil; then click on the pencil and add or edit text in the resulting box.389389

Using Commonly Used With

Clicking on a link below this heading opens a dialog window. This window provides details of the usage; for example:1008

Advanced Description Editing

The Stemma UI provides a WYSIWYG editor for descriptions of tables and columns.

Using the editor

  1. To begin editing, navigate to the table or column whose description you want to edit, and click on the pencil icon next to Description:

This opens up the text box for editing:

  1. Add or change the description.
    • Descriptions can include links and images as well as text.
    • You can choose font styles as indicated by the buttons.
    • Use the pull-down menu to toggle the text format from the default (Normal) to common formatting elements such as headings, lists, quotes, and code blocks.
      • When you choose Code Block, you can choose the type of code from the secondary pull-down menu (the default is JavaScript).
    • You can add hyperlinks: enter and highlight the text to display, then click the 🔗button and enter the URL (and edit it if necessary) in the text box next the button.
    • You can also add links to other tables.
    • The editor also supports Markdown syntax, as well as any shortcuts your system supports (such as ⌘-b for bold on a Mac)
  2. When you are finished, click Update to save your changes.

Linking to other tables from the description

This is useful, for example, if you want to give other users more information about the relationship between the tables.

To link to another table from the description, do the following:

  1. In the Stemma UI, navigate to the table whose description you want to edit.
  2. Click on the pencil icon next to Description.
  1. Enter the new desription providing the related table’s<schema_name>.<table_name> anywhere in the text.
  1. Click Update to save your changes.

The new description will contain the link to the related table:

Sharing column descriptions

You can share a column description with all columns of the same name across the catalog. To do this:

  1. In the Stemma UI, navigate to the table whose column description you want to share.
  2. To add or modify a description, click on the down-arrow to the left of the entry for the column:
  1. Hover over the word Description and click on the pen that appears:
  1. Add or update the description and click on Update. Stemma automatically discovers all columns with the same name, and the description propagates to all linked columns. To see which columns are linked, and add and remove links, click on the three dots to the right of the column entry:

Click on the Edit linked columns pop-up. In the dialog window, you can:search for and add columns to link, and remove existing links.

When you have finished, click Done.

Finding Important Information about Dashboards

Stemma provides out-of-the-box integration with dashboarding systems such as Looker to capture rich metadata from these systems automatically.

The Stemma UI makes it easy for you to discover this important information about your dashboards, and provides a convenient way to add and modify curated information.

Status:

This can be Certified, Deprecated, Intermediate, or No Status:

288

Description:

A brief summary of the contents of the dashboard:

360

Owners:

The person or people responsible for maintaining this table:

288

This information is extracted automatically from the source and (unlike table ownership) can’t be edited in Stemma.

Tags:

Descriptive tags for cross-referencing, such as Covid, cases, etc.:

288

Created:

The date and time the dashboard was created:

360

Last successful run:

Date and time the dashboard was last run successfully:

360

Last Updated:

Date and time of the last update to the dashboard:

360

Last run:

Date and time the dashboard was last run (successfully or not):

360

Total View Count:

Total number of views over the life of the dashboard:

360

Tables:

Tables used by the dashboard:720

Charts:

Charts available from the dashboard:

360

Queries:

Queries run from the dashboard. Click on a query name to see the code:432

Open folder:

Open the dashboard’s folder in the underlying application (you may need to log in to the application).

Open dashboard:

Open the dashboard in the underlying application (you may need to log in to the application).

Using Admin Functions and Reports

API Keys

Allows you to add and delete keys for the Graphql API.

Connections

Allows you to add, modify and delete connections to Stemma. For instructions on adding connections, see Integrating Your Applications with Stemma: What We Need from You. See also How Do I Update an Existing Connection?

Reports

These reports provide an account of your organization’s interactions with the Stemma catalog, including curating work that needs to be done, as well as usage.They can be especially useful when you are rolling our Stemma to your users and scoping the work remaining to do.

  • Customer Summary
    Shows the number of data sources, tables, and dashboards that are integrated
    into the Stemma catalog.
  • Monthly product usage
    Shows the number of users and searches by month.
  • Documentation summary
    Shows the percentage of columns, dashboards, and tables missing documentation and/or ownership.
  • Assets without docs
    Shows the specific tables and dashboards that lack documentation. Click on the drop-down to switch between the two types of asset

864

  • Assets without ownership
    Shows the specific tables and dashboards that lack ownership (click on the drop-down to switch between the two types of asset).

288

How Stemma Captures Data: SQL Parsing Overview and Caveats

What Stemma captures

When you execute SQL against your warehouse, the database captures the raw SQL that was executed and saves it in a historical format that includes the time that the SQL command was executed as well as the user who executed it. Stemma reads these SQL logs and parses all of the SQL commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page:

  • Table and column lineage: any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
  • User table usage: Stemma uses select statements to associate users to the table(s) being queried; this provides the Frequent Users information on the Table Details page, search rankings, and the user’s profile page.
  • Commonly Used With: any joins that occur within the SQL, whether on the top-level query or as part of a sub-query, are captured and presented on the Table Details page to help other users understand how two or more tables are commonly joined.

Caveats:

  • When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
  • When user IDs in the database do not match the user portion of the user’s email address as recorded in Stemma, Stemma cannot associate the Stemma user to the user in the query logs. This prevents Stemma from creating any user-table usage information.
  • When system accounts are used to query the data, Stemma is often not able to see which user is executing the query. This is a common practice with BI tools when multiple team members share the same connection to a database and that connection uses a system account.
    • However, if you are using Mode, Stemma can parse the comments appended by Mode (which do contain the user ID) to find the user and associate that user to the query.
  • Commonly Used With does not capture joins where the left or right table is a dynamically generated select statement, for example: select * from table_a a join (select * from table_b b1 join table_c c1 on b1.id = c1.id) b on a.id = b.id.
  • Stemma currently supports Snowflake out of the box. Stemma can also parse RedshiftAthena, Postgres and other databases, but does not provide turnkey support for those databases at present.

Planning Changes to Tables or Columns

When you are planning to make a backward-incompatible change, you need to proceed in a way that minimizes the impact and gives downstream stakeholders as much warning as possible. Examples of backward-incompatible changes are:

  • Changing a schema (type changed, name changed)
  • Changing values (data fields will no longer be populated, or will be populated differently)

You need to understand and communicate the impact before you make such a change. This is usually done in the following steps:

  1. Understand the impact: understand who uses or queries this data.
  2. Provide heads-up notification: notify downstream owners and users that this data is going to change, and that they need to update their assets (downstream tables or dashboards) accordingly. Provide a time window after which the change will go into effect.
  3. (ideally): Understand the impact again: see if downstream users have changed their assets; how many still need to be nudged, etc.?
  4. Make the actual change and follow up to make sure that it’s complete.

Stemma can help you make the change as painless as possible. Use the UI to assess the effect on downstream users and on other assets. The following sections provide help:

If you are deprecating a table:

  • Change the description to warn users about the impending change.
  • Change the status toDeprecated.

Now warn all owners and frequent users about the change, and give them a deadline to make their adjustments.

Note

If you decide to change the name of a table to indicate that it’s deprecated (e.g., to something like deprecated_tablename), that will cause it, by default, to be removed from Stemma after ten days.

Using the External GraphQL API

Creating and Installing an API Key for the External Graphql API

1. Create an API key in the Stemma UI:

a. Navigate to the Admin panel and choose Add API Key:

432

b. In the pop-up window, enter a name for your API key and click Submit. You can choose any name, but Stemma recommends using a name that describes the intended use of the key (e.g., github action integration):432

Stemma generates an API key that appears in the window. Save this key somewhere secure. It is not recoverable.

2. Install the Altair Graphql Client as one of these:

3. Add your API key to Altair as an Authorization header.

864

It must be in the form Bearer <api key> with a space between the word Bearerand your API key.

4. Enter your URL and HTTP method:

  • URL = /graphql
    E.g., https://preview.stemma.ai/graphql
  • HTTP method = POST

Now click Docs in the upper right of the Altair UI; you should see the graphql schema on the right hand side:864

Frequently Asked Questions (FAQs)

What Information Does Stemma Extract Automatically?

The table below shows what information you can expect to see in Stemma after you have integrated all of your applications.

SourceMetadataUser UsageLineageQuery UsageColumn Stats
Snowflake
Redshift
BigQuery
Postgres
Athena
Delta Lake
  • Metadata: tables and columns, descriptions. Where available, the time that the table was last updated. See also Sharing Additional Metadata.
  • User usage: which users are using which tables in the database.
  • Lineage: how data moves between tables, views, and dashboards. See also Tracking Data Lineage.
  • Query usage: how tables are joined; common patterns for using tables together.
  • Column stats: statistics (mean, min, max, null count, etc.) for columns in a table.

For more information, see:

How Does Stemma Discover Data Lineage?

When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.

In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.

Note

When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.

See also:

Does Stemma Support User Roles?

Apart from Administrator (see What Can an Administrator Do?), Stemma does not at present support a variety ofuser roles. This means that every user has edit capability; that fosters adoption and comprehensive use, allowing the broad community to get familiar with Stemma and learn its full range of capabilities.

As our user base becomes becomes more expert with Stemma and starts to specialize, we expect to introduce user roles.

What Can an Administrator Do?

To become a Stemma administrator, or nominate one, contact Stemma support ([email protected]).

A Stemma Administrator can:

For more information, see Using Admin Functions and Reports.

How Can I Tell What Stemma Automates and What I Need to Curate?

Automated documentation


Stemma provides out-of-the-box integration with applications such as Snowflakedbt, and dashboarding systems such as Looker to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.

The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.

Stemma also allows linking of existing Slack conversations about data to the data catalog.

Curated documentation

Items you need to curate include editable descriptions, tags, and owners. All of these can also be modified by APIs. APIs can also ingest custom metadata.

In the Stemma UI, items that you can curate will show a pencil when you hover your cursor over them; for example:389

More information here.

How Do I Add a User?

At present you need to contact Stemma support ([email protected]) with the names of the users you want to add. Stemma will add them for you, and you then need to authenticate them, making sure they have the access they need to do work in Stemma (see the sections on integrating SSO apps listed here) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake.

How Do I Update an Existing Connection?

Updating a Connection

At present you need to add a new connection and delete the old one. You need Stemma’s help to do this.

First add a new connection as follows:

  1. In the Stemma UI, choose Admin from the toolbar at the top.
  2. Click the button for Add a New Connection.
  • Do not try to edit the existing connection.
  1. In the pop-up window, choose the connection type and provide the information as prompted. See What We Need From You.
  2. Click Submit.
  • This creates a new connection that is not yet active.

Now, contact Stemma support ([email protected]) with a request to activate the new connection. Stemma will let you know when the connection is active.

Once you have verified that the new connection is active and working properly, you can delete the old connection:

  1. In the Stemma UI, choose Admin from the toolbar at the top.
  2. Click the Delete button next to the connection you want to delete, and in the pop-up window confirm that you want to delete it.

What Effect Does Deprecating a Table Have?

You can set the status of a table or dashboard as deprecated or intermediate. When you do so, that asset remains in the Stemma catalog, but is removed from default searches (you can change this by means of filters in Advanced Search).

When you deprecate an asset or change its status to intermediate, you should modify its description accordingly, and you should also be careful to message all users and owners of the asset (including users who might be using the asset indirectly, such as dashboard users of a table) to warn them about the change in status. This allows you to make sure you hear from all stakeholders before you remove or move the asset.

How Does Stemma Calculate "Top 5 Columns", "Frequent Users", etc?

The table details page in the Stemma UI provides usage information, derived as follows:

  • To calculate the top five upstream and downstream columns, Stemma multiplies the number of unique users by the number of queries to get the ranking. The time frame is 30 days.
  • To identify tables commonly usesd with this one, Stemma counts the joins that were part of any query in the past 90 days.
  • To identify frequent users of a table, Stemma counts any user who has run a query on this table in the last 30 days.

Getting Started with the Stemma Business Glossary

What Is a Business Glossary?

A business glossary is a repository of business terms and their definitions that an organization uses to ensure the same definitions are used company-wide when analyzing data. It allows data analysts, business users, and data engineers to agree on the definition of real-world terms, and how those are represented in the data; and serves as the source of truth for those definitions

Used this way, the glossary produces a common vocabulary that everyone in the organization uses. This common vocabulary is a key component of data governance, ensuring that everyone means the same thing when they use the same term. For instance, one department may define the term “customer” as a company while another department may use the term to refer to an individual. A business glossary prevents such discrepancies.

In short, a business glossary provides canonical definitions of common terms and fosters shared understanding of those terms throughout the organization.

The Stemma Business Glossary

This first version of the Stemma Business Glossary allows you to define and group terms that are used in your organization and assign owners to those terms so that the definitions remain consistent and up to date. These terms can be specific to the data sources in the Stemma catalog, or they can be terms that are used throughout the organization or any part of it.

You can also link those terms to table columns in the Stemma catalog.

In future, you will also be able to link terms to other resources in the catalog, and to queries and search results.

Using the Glossary

Populating the Glossary

To begin populating the glossary, proceed as follows.

From the navigation bar of the Stemma UI, choose Glossary:

Choose + New Group or + New Term.

In the dialog window, add a name and a definition or description.

You can use Markdown syntax for the description or definition.
– A group is a collection of related terms. For example, you might want to group terms by department or function– finance, marketing, etc.
– You can add a term to an existing group, or create a new term without assigning it to a group. (You can’t create a group on the fly while adding or editing a term).

– You can add related columns and one or more owners: start typing to find candidates.

When you are finished, click Save.

Updating the Glossary

To update the glossary:

From the navigation bar of the Stemma UI, choose Glossary.

To modify a group, click on the group and then click Edit. You can change the group’s attributes and also add new terms to it.

To modify a term, search for it by name, or click the > next to the group and select the term from the list;. then either:-

  • Click Edit and make your changes to any or all attributes of the term. You can change the group the term belongs to, so long as the new group already exists.

OR

  • Click Delete to remove the term from the glossary.

When you are finished, click Save.

Azure Active Directory Integration

Create an Azure Application

To create an application, follow these steps:

  1. Sign in to Azure portal.
  2. Click Azure Active Directory.
  3. On the Azure Active Directory page, click Enterprise applications.
  4. Click New application on the top menu.
  1. Click Create your own application.
  1. Enter a name for your application, we suggest “Stemma Data Catalog“.
  2. Select Register an application to integrate with Azure AD (App you’re developing).
  1. Click Create. This should redirect you to Register an application.
  2. Under Supported account types, select Accounts in this organizational directory only (Single tenant).
  3. Under Redirect URI (optional), select Web. If you are running a hosted version of Stemma, the redirect URI should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name. In case of on-premise deployment, please consult your Stemma representative for this.
  1. Click Register.

Create a client secret

To create a client secret, follow these steps:

  1. Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
  1. Select your newly added application. Make a note of the Application ID.
  1. Click Certificates & secrets on the left-hand side and click New client secret.
  1. Provide a description and an expiration length that follows your security organization’s guidelines.
  2. Click Add.
  1. Copy the Value corresponding to the newly-created client secret and client secret expiration.
    This is the Relying Party OAuth Client Secret. This secret value is never displayed again after you leave this page.

Set up application permissions

You need to set up two application permissions in your Azure portal so that you can use all of the Azure AD synchronization options in Stemma. You need to set up the following permissions:

  • Azure Active Directory Graph Directory.Read.All
  • Microsoft Graph Directory.Read.All

To set up permissions, do as follows:

  1. Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
  1. Select your newly added application.
  2. Click API permissions on the left-hand side and click Add a permission.
  1. Click APIs my organization uses and click Windows Azure Active Directory.
  1. Select the Azure Active Directory Graph > Directory.Read.All permission.
    a. Click Application permissions.
    b. Under Directory, click Directory.Read.All.
    c. Click Add permissions.
  1. Select the Microsoft Graph > Directory.Read.All permission.
    a. Click Application permissions.
    b. Under Request API Permission, click Microsoft Graph.
    c. Under What type of permissions does your application require?, click Application permissions.
    d. Under Directory, click Directory.Read.All.
    e. Click Add permissions.
  1. Click Grant admin consent for and then click Yes. You should see a message saying that you’ve granted consent for each of your permissions.

What We Need From You

Provide Stemma with the following information by email or Slack:

  • Application ID
  • Client Secret
  • Discovery Document Endpoint (Well Known OpenID Configurations)
  • Tenant ID – The ID of the tenant you created the application in.

Guidelines for a Successful Stemma Rollout

Every Stemma rollout is different, depending on your organization’s particular needs, your data storage and management applications, and the people who use them in various ways and with varying priorities. So the following are guidelines, not rules, but they are based on the accumulated experience of Stemma’s founders and engineers, and are a good place for you to start.

Criteria for Success

The two criteria that matter in measuring the success of a data catalog are

  • adoption; and
  • customer satisfaction

If your rollout achieves both of these in a high degree, you can declare victory in the knowledge that you have improved the lives– and in particular the productivity, efficiency, and awareness– of your users and are contributing significantly to your organization’s success.

The following broad steps are designed to help you get there.

Steps for Rollout and Adoption

Step 1. Start with a persona and its use-cases

There are many user personas and use cases for a data catalog. Here’s a simplified view of the most common ones. It’s less important which persona you start with than that you choose a specific group of alpha users.512

Step 2. Launch in phases

Phase 1: Identify a small set of tables to get alpha user feedback on.

This set can be your organization’s most commonly used tables (often referred to as “core” tables) or one domain within the company, such as marketing, growth or finance, etc. Core tables are often the best choice, because they have the most impact and there’s often a central data team which is responsible for maintaining them.

Phase 2: Populate MVP metadata on these tables.

  • Stemma can ingest the majority of this metadata automatically (see What Stemma Enables You To Do) leaving you and your users to curate only certain important items such as ownership, table descriptions and table status
    .* Where you must, to glean tribal knowledge, it may also help to do a “docs jam session” with a group of data producers and consumers. (You might even offer a reward such as a gift card for those who enter the most documentation!)

Phase 3: Alpha launch to 5-20 alpha users.

  1. From the persona you decided on earlier, choose highly vocal users and those with the most tribal knowledge. These users will become Stemma’s avid supporters when you launch to a broader audience. (Note:Make sure these alpha users have the access they need to do work in Stemma (see SSO integration under What We Need From You) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake
  2. Incorporate feedback and iterate. Look particularly for feedback that indicates opportunities for productive metadata propagation and cross-referencing (for example, someone might say something like, “Oh, we already have this metadata in this spreadsheet — we should pull that in here, too.” )

Tasks your alpha users should become comfortable doing include:

  • Search – Search for tables and dashboards using a Page-Rank style search

720

  • See table details:
    • Description and status:
    • Issues (JIRAs)
    • Last updated
    • Data owner(s):
336

441

  • See column descriptions:

1296

  • Perform table actions:
    • Edit table status, description, issues, owners
    • Add/edit/remove tags
336

144

  • Show items commonly used with tables:

1008

  • See table lineage
  • Search for dashboards and see dashboard details:
    • Owners
    • Created time
    • Last updated time, etc.

308

  • See queries run by the dashboard, etc:

720

Phase 4: Beta launch to all users of the prioritized persona.

  1. Focus your beta launch on your chosen persona (data consumers, for example). Don’t blur the focus of your launch by opening it up to all personas. Those other personas will come on board at GA, after your alpha and beta users have thoroughly learned and exercised the product and can serve as its advocates and in-house experts.
  2. Graduate to GA if you can meet success targets.

Step 3. Achieve widespread adoption

Best practices for achieving the greatest adoption include:

  • Update Slack channel headers where people ask each other questions. Stemma’s Slack integration links these conversations to the catalog automatically.
  • Embed Stemma training in new hire training. Tagging data sets by domain (marketing, growth, etc.) can help new hires quickly onboard to their domains. Showcase Sremma as an entry point into your existing training, for example by having all technical new hires instrument a metric during onboarding.
  • Create linkages with other products. Create links between various data tools. For example, auto-populate a link between an Airflow DAG that populates a table and the table page in Stemma (and vice-versa). (See Sharing additional metadata). Another productive linkage is between the table page in the data catalog and a link to the code that is used to generate the table.
  • Showcase the catalog at a group or company meeting. Deliver a short five-minute demo at a meeting that includes your target users. Educate, answer questions, and thank your alpha users — this creates more awareness and provides an opportunity for everyone to learn.

Step 4. Measure success

1. Adoption

  • Track WAUs (Weekly Active Users).
    • Start with Weekly Active Users (WAUs) rather than Daily Active Users or Monthly Active Users. Common usage frequency is weekly, not daily or monthly.‍
      • Target Penetration rate: 80%. A great penetration rate is 80% of WAUs within your target persona.

2. Customer Satisfaction (CSAT)

  • Measure CSAT periodically.
    • Send out a survey every three to six months asking users to rate their experience with Stemma.

Other metrics that organizations often consider include documentation quality, search quality, etc. But it’s best to stick to the above core metrics at the outset. As your Stemma implementation matures and ingests more metadata over time, you can begin to broaden your criteria.

Can I Choose What Data Stemma Ingests?

If you don’t want Stemma to ingest all the available data in the applications you integrate (for example if you want to ingest just one schenma from a data lake), you can tell Stemma what to ignore: Contact Stemma by email ([email protected]) or talk to your support rep.

Messaging Users

Stemma provides a quick and easy way to send messages to specific groups of users on the basis of their relationship to a table (e.g. are they an owner in Stemma, have they queried the table in the prior three months, etc.). If you need to announce a future change, for example, or notify users when significant events occur, such as when a table has been deprecated, you can use Stemma messaging to ensure that the affected users will receive the notification.

🚧 Before you begin – enable email and Slack notifications

If you want to enable Slack notifications, make sure that you have completed the Slack Integration

Notification sources

There are two forms of notifications within Stemma:

  1. Table status changes: you can send a message to affected users when the status is changed, for example from Certified to Deprecated.
  2. Custom messages: You can create your own message and choose which groups of users and teams to send it to.

Table status changes

You can notify the owners of a table when the status changes.

📘 Note

This feature sends a notification to all owners of the resource. If you want to notify other selected user groups, you can use the Message functionality. Learn more about user groups.

Custom messages

You can compose a custom message to users and teams related to a table by clicking the Message button on the top-right corner of the Table detail page.

Select the user groups you want to notify and type your message in the box.

Notification types

There are three types of notifications you can send in Stemma, depending on the receivers and your personal preferences. If a member of a related group is a user, they can receive either email notifications or Slack private messages. If a member of a related group is a team (Slack channel), they will receive a message in Slack.

Slack notifications

After you have completed our Slack Integration, you can send and receive messages from the Stemma bot.

Messages in channels

If a member of a related group is a team, they will receive a message from the Stemma bot in the corresponding Slack channel.

📘 Note

Make sure you add the bot to all channels that need to receive messages.

Slack private messages

You can choose to receive notifications from Stemma as Slack private messages. To receive messages from the bot, you need to add the Stemma application to your Slack configuration.

Click **Browse apps** in the **Apps** section located on the bottom-left side of the screen
Click Browse apps in the Apps section located on the bottom-left side of the screen
Select the **Stemma** app from the list.
Select the Stemma app from the list.

📘 Note

To modify your preferred notification method, contact your Stemma representative.

User groups

Each group contains a list of users and teams (Slack channels). In Stemma there are four different types of user groups related to a table:

  • Owners of the table
  • Owners of dashboards using the table
  • Owners of derived tables
  • Frequent users of the table

Stemma will only show the options that apply to the given asset. For example, if a user is sending a custom message for a table that does not have any downstream lineage there will not be an option for Owners of derived tables.

Frequently Asked Questions (FAQs)

What Information Does Stemma Extract Automatically?

The table below shows what information you can expect to see in Stemma after you have integrated all of your applications.

SourceMetadataUser UsageLineageQuery UsageColumn Stats
Snowflake
Redshift
BigQuery
Postgres
Athena
Delta Lake
  • Metadata: tables and columns, descriptions. Where available, the time that the table was last updated. See also Sharing Additional Metadata.
  • User usage: which users are using which tables in the database.
  • Lineage: how data moves between tables, views, and dashboards. See also Tracking Data Lineage.
  • Query usage: how tables are joined; common patterns for using tables together.
  • Column stats: statistics (mean, min, max, null count, etc.) for columns in a table.

For more information, see:

Can I Choose What Data Stemma Ingests?

If you don’t want Stemma to ingest all the available data in the applications you integrate (for example if you want to ingest just one schenma from a data lake), you can tell Stemma what to ignore: Contact Stemma by email ([email protected]) or talk to your support rep.

How Does Stemma Discover Data Lineage?

When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.

In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.

Note

When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.

See also:

Does Stemma Support User Roles?

Apart from Administrator (see What Can an Administrator Do?), Stemma does not at present support a variety ofuser roles. This means that every user has edit capability; that fosters adoption and comprehensive use, allowing the broad community to get familiar with Stemma and learn its full range of capabilities.

As our user base becomes becomes more expert with Stemma and starts to specialize, we expect to introduce user roles.

What Can an Administrator Do?

To become a Stemma administrator, or nominate one, contact Stemma support ([email protected]).

A Stemma Administrator can:

For more information, see Using Admin Functions and Reports.

How Can I Tell What Stemma Automates and What I Need to Curate?

Automated documentation


Stemma provides out-of-the-box integration with applications such as Snowflakedbt, and dashboarding systems such as Looker to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.

The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.

Stemma also allows linking of existing Slack conversations about data to the data catalog.

Curated documentation

Items you need to curate include editable descriptions, tags, and owners. All of these can also be modified by APIs. APIs can also ingest custom metadata.

In the Stemma UI, items that you can curate will show a pencil when you hover your cursor over them; for example:389

More information here.

If there's a Curated and an Ingested Description, Which Will I See in Stemma?

If Stemma ingests the description of a table or column (for example, from Snowflake) and you subsequently edit or replace that description in Stemma, Stemma users will see that curated description from then on, even if Stemma subsequently re-ingests the original Snowflake description (or any other version of it).

At present, Stemma does not provide a way for you to see both the Stemma description and the ingested description side by side.

How Do I Add a User?

At present you need to contact Stemma support ([email protected]) with the names of the users you want to add. Stemma will add them for you, and you then need to authenticate them, making sure they have the access they need to do work in Stemma (see the sections on integrating SSO apps listed here) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake.

How Do I Update an Existing Connection?

Updating a Connection

At present you need to add a new connection and delete the old one. You need Stemma’s help to do this.

First add a new connection as follows:

  1. In the Stemma UI, choose Admin from the toolbar at the top.
  2. Click the button for Add a New Connection.
  • Do not try to edit the existing connection.
  1. In the pop-up window, choose the connection type and provide the information as prompted. See What We Need From You.
  2. Click Submit.
  • This creates a new connection that is not yet active.

Now, contact Stemma support ([email protected]) with a request to activate the new connection. Stemma will let you know when the connection is active.

Once you have verified that the new connection is active and working properly, you can delete the old connection:

  1. In the Stemma UI, choose Admin from the toolbar at the top.
  2. Click the Delete button next to the connection you want to delete, and in the pop-up window confirm that you want to delete it.

What Effect Does Deprecating a Table Have?

You can set the status of a table or dashboard as deprecated or intermediate. When you do so, that asset remains in the Stemma catalog, but is removed from default searches (you can change this by means of filters in Advanced Search).

When you deprecate an asset or change its status to intermediate, you should modify its description accordingly, and you should also be careful to message all users and owners of the asset (including users who might be using the asset indirectly, such as dashboard users of a table) to warn them about the change in status. This allows you to make sure you hear from all stakeholders before you remove or move the asset.

What Happens when a Table or Dashboard Is Removed?

Removing a table or dashboard causes it, by default, to be removed from Stemma after ten days. Similarly, if the table or dashboard is renamed or moved, the original entry is removed from Stemma after ten days.

Before removing an asset, you should deprecate it to give stakeholders sufficient warning.

How Does Stemma Calculate "Top 5 Columns", "Frequent Users", etc?

The table details page in the Stemma UI provides usage information, derived as follows:

  • To calculate the top five upstream and downstream columns, Stemma multiplies the number of unique users by the number of queries to get the ranking. The time frame is 30 days.
  • To identify tables commonly usesd with this one, Stemma counts the joins that were part of any query in the past 90 days.
  • To identify frequent users of a table, Stemma counts any user who has run a query on this table in the last 30 days.

BigQuery Integration

Stemma extracts table and column metadata including database, schema, table name, table description, column name, and column description from a Bigquery database.

What We Need From You

You will need to create a service account for reading metadata and grant it “BigQuery Metadata Viewer” access to all of your datasets. This can all be done via the BigQUery UI.

Next, collect the information Stemma needs, as follows:

  • Authorization/Access Scopes:
    Stemma uses the following scopes to extract metadata and usage information.
    — <https://www.googleapis.com/auth/bigquery.readonly>
    — <https://www.googleapis.com/auth/cloud-platform>
  • JSON Key File: Proceed as follows to create the JSON key file for a service account.

Create a service account (via Cloud Console):

  1. In the Cloud Console, go to the Create service account page.
  2. Select a project.
  3. In the Service account name field, enter a name. The Cloud Console fills in the Service account ID field based on this name.
  4. In the Service account description field, enter a description. For example, `Service account for Stemma
    BigQuery Integration
  5. Click Create and continue.
  6. Click the Select a role field, and make sure to select the following:
    — BigQuery Data Viewer (roles/bigquery.dataViewer) – to extract the metadata, watermarks and usage information. Popular tables will be generated using this information.
    — BigQuery User (roles/bigquery.user) – Needed to map the user’s information with the Tables i.e., Frequent Users, Table Owners, etc.
    — Private Logs Viewer (roles/logging.privateLogViewer) – Needed for enriching table usage.
  7. Click Continue, and click Done to finish creating the service account.

Do not close your browser window. You will use it in the next step.

Create a service account Key:

  1. In the Cloud Console, click the email address for the service account that you created.
  2. Click Keys, click Add key, then click Create new key.
  3. Click Create. A JSON key file is downloaded to your computer.
  4. Click Close.

Provide the information to Stemma

To provide this information to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Google Cloud as the Connection type, provide a Connection Name, and paste in the Service Account Key from the JSON file you have just downloaded.
  2. When you are finished, click Save Connection.

Migrate from existing Amundsen

Migrating from Amundsen involves three main steps:

  1. Migrating existing metadata from Amundsen to Stemma
  2. Ingesting metadata continuously
  3. Cut over

1 – Migrating existing metadata from Amundsen to Stemma

In this step, you take all the existing metadata from Amundsen and load it into Stemma. The metadata imported includes, but is not limited to:

  • Table and column descriptions
  • Tags
  • Ownership information

To import this metadata, upload a dump of your Amundsen neo4j metadata into Stemma:

  1. Bring the Amundsen storage offline.
  2. Use dump -> load, available on all neo4j installations, to dump and upload your data.

After import, your Stemma instance will look very similar to your existing Amundsen instance:512

But your data won’t be periodically updated until you enable periodic updates; see the next step.

2 – Ingesting metadata continuously

You need to configure Stemma to ingest metadata updates on an ongoing basis. Once this configuration is done, you won’t need Python databuilder jobs to ingest data.

Information ingested in this step includes:

  • Table and column names
  • Linked issues (JIRA tickets)
  • Frequent users
  • Lineage, if it exists

IIn Stemma’s Admin interface, provide credentials to your data sources to allow Stemma access to extract metadata on an ongoing basis:

374

Stemma will now do the work to “upsert” metadata updates, as they occur, into the data you uploaded from Amundsen in the previous step.

3 – Cut Over

At this point, Amundsen and Stemma are running side-by-side, and Stemma recommends sharing access to Stemma with your power users and getting their feedback to ensure all of their use-cases are supported. Usually one week of overlap to obtain feedback and an additional week to incorporate feedback, if applicable, is sufficient.

When all looks good, simply redirect the URL of your internal Amundsen to the Stemma URL.

If you have any feedback or questions about moving from Amundsen to Stemma integration, please reach out to [email protected]!

How Does Stemma Discover Data Lineage?

When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.

In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.

Note

When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.

See also:

Does Stemma Support User Roles?

Apart from Administrator (see What Can an Administrator Do?), Stemma does not at present support a variety ofuser roles. This means that every user has edit capability; that fosters adoption and comprehensive use, allowing the broad community to get familiar with Stemma and learn its full range of capabilities.

As our user base becomes becomes more expert with Stemma and starts to specialize, we expect to introduce user roles.

Getting Started with the Stemma Business Glossary

What Is a Business Glossary?

A business glossary is a repository of business terms and their definitions that an organization uses to ensure the same definitions are used company-wide when analyzing data. It allows data analysts, business users, and data engineers to agree on the definition of real-world terms, and how those are represented in the data; and serves as the source of truth for those definitions

Used this way, the glossary produces a common vocabulary that everyone in the organization uses. This common vocabulary is a key component of data governance, ensuring that everyone means the same thing when they use the same term. For instance, one department may define the term “customer” as a company while another department may use the term to refer to an individual. A business glossary prevents such discrepancies.

In short, a business glossary provides canonical definitions of common terms and fosters shared understanding of those terms throughout the organization.

The Stemma Business Glossary

This first version of the Stemma Business Glossary allows you to define and group terms that are used in your organization and assign owners to those terms so that the definitions remain consistent and up to date. These terms can be specific to the data sources in the Stemma catalog, or they can be terms that are used throughout the organization or any part of it.

You can also link those terms to table columns in the Stemma catalog.

In future, you will also be able to link terms to other resources in the catalog, and to queries and search results.

Using the Glossary

Populating the Glossary

To begin populating the glossary, proceed as follows.

From the navigation bar of the Stemma UI, choose Glossary:

Choose + New Group or + New Term.

In the dialog window, add a name and a definition or description.

You can use Markdown syntax for the description or definition.
– A group is a collection of related terms. For example, you might want to group terms by department or function– finance, marketing, etc.
– You can add a term to an existing group, or create a new term without assigning it to a group. (You can’t create a group on the fly while adding or editing a term).

– You can add related columns and one or more owners: start typing to find candidates.

When you are finished, click Save.

Updating the Glossary

To update the glossary:

From the navigation bar of the Stemma UI, choose Glossary.

To modify a group, click on the group and then click Edit. You can change the group’s attributes and also add new terms to it.

To modify a term, search for it by name, or click the > next to the group and select the term from the list;. then either:-

  • Click Edit and make your changes to any or all attributes of the term. You can change the group the term belongs to, so long as the new group already exists.

OR

  • Click Delete to remove the term from the glossary.

When you are finished, click Save.

dbt integration

Stemma integrates with your dbt Cloud or Core projects and is able to ingest and keep up to date all of the metadata in the table below.

 dbtStemma
TablesTables
Table lineageTable lineage
ColumnsColumns
Table definitionsProgrammatic descriptions (dbt descriptions section)
Column definitionsColumn definitions

dbt Cloud

Integrating with dbt Cloud requires minimal information and access to the dbt Cloud account of your organization and an administrative access to Stemma.

The following 3 properties are needed from dbt Cloud:

  • Account ID
    • Usually visible within the browser URL on the account settings page: https://cloud.getdbt.com/next/settings/accounts/<account-id-number> or within any any dbt Cloud project: https://cloud.getdbt.com/next/deploy/<account-id-number>/projects/<project-id>.
  • Service Token
    • Create a Service Token with the appropriate permissions:
      • For organizations on the Team Plan, the account wide permissions required are Metadata and Read-Only.
      • For organizations on the Entreprise Plan, the necessary permission set is Account Viewer and Job Viewer for all projects that should have metadata ingestion.
  • Host
    • Unless organization specific customization has taken place, the host is likely cloud.getdbt.com

With these credentials in hand, all that’s left is adding the connection information to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection.
  2. In the pop-up window, choose dbt as the Connection type, provide a Connection Name, and fill in the API Token, Host and Account ID collected from dbt Cloud.
  3. When you are finished, click Save Connection

dbt Cloud integration may take up to 24 hours for initial ingestion.



dbt Core

To support ingesting metadata from dbt artifacts for dbt Core workflows, we require you to generate them daily and upload them to an AWS S3 cloud storage bucket provisioned by Stemma.

Stemma’s dbt integration relies on the following pieces of information:

  • The database name (e.g., “snowflake”, “postgresql”)
  • manifest.json and catalog.json (produced via dbt docs generate)
  • Optional: The base url for your dbt github repository (e.g., “https://github.com/{customer}/dbt-{customer}/tree/main
    “). This allows us to includes github links to your dbt models on Stemma’s table details page.

We recommend testing out the integration first by simply sending the two JSON files to your Stemma contact via Slack or email.

For an ongoing integration, we’ll provide you with the following information for the daily delivery of the dbt artifacts.

  • AWS Access Key pair
  • S3 Bucket (e.g, “s3://{customer}-stemma-integrations”)

The S3 prefix path will be “dbt/{date}” where “{date}” is the current days date in UTC in the format “YYYY-MM-DD”. For example, for a file delivery on 04/01/2022 we would expect to see the following files:

  • “s3://customerxyz-stemma-integrations/dbt/2022-04-01/manifest.json”
  • “s3://customerxyz-stemma-integrations/dbt/2022-04-01/catalog.json”

Using Tableau with Stemma

Stemma works with Tableau to help answer common questions data analysts have, such as:

Does this dashboard already exist?

Stemma indexes Tableau dashboards along with their underlying tables, providing a one-stop search for all data assets. Stemma also tracks how many times a dashboard has been viewed, instantly highlighting which is most commonly used.1400

Is the right data being used?

Stemma automatically correlates your Tableau dashboards to the corresponding table asset by reconciling the table names extracted from the Tableau Metadata API with the existing table metadata Stemma has captured from your warehouse. This allows you to start your exploration by finding trustworthy reports and, in turn, using the lineage to discover which tables are useful to you.1388

You can also explore the table’s lineage upstream or downstream to get an end-to-end view of where the data came from.1391

How can I use the metrics in this dashboard?

Tableau provides Stemma with the SQL that it creates to generate each sheet, and Stemma indexes this SQL so you can easily find the code. You can copy and paste the code into your IDE to jump-start the SQL exploration.1600

Best Practices for Data Ownership

👍 The three types of ownership that work well

It works well to break down ownership of the most important date into three distinct roles:

  • Delivery owner — Ensures that this particular data gets delivered on time, on a promised delivery SLA. Usually, this is a data engineer or analytics engineer responsible for developing and maintaining the pipeline that produces this data.
  • Domain owner — What does this particular value in a field (or column) mean? When does this particular event get triggered? Usually, this is a product engineer who created the event or the analysts and data scientists who use this data most frequently and understand the physical reality that the data represents.
  • Policy owner — Ensures that this data gets used according to the classification and terms associated with it. Sometimes you acquire data from a source that should not be used for a certain category of use-cases. For example, YouTube is permitted to show ads to kids, but not permitted to personalize them. Therefore, the personalization data can’t be used if the subject is a child. The person making these calls is usually not an engineer or data scientist, but someone on the Policy or Privacy team at the company.

👎 Types of ownership that often don’t work as well

  • Overall Quality owner: In practice it’s often hard to find people able and willing to take ownership of the end-to-end quality of a data set. This is because data engineers don’t consider themselves owners of the data that’s produced from the upstream application and don’t want to be responsible for hunting down a website bug that impacts the data in the warehouse. Product engineers, on the other hand, don’t have enough context about how data gets joined and transformed downstream to own the final derived data artifact. This may change as decentralized data management (or data mesh) is deployed more broadly.
  • Shared ownership: This often does not work in practice, though it can work well if the owner group shares a good understanding of the different types of ownership and the group can efficiently redirect each question to the right person.

🏥 The data engineer’s role

In practice, the data engineer plays the role that a triage nurse would play in an Emergency Room. When an issue arises (like a patient arriving in the ER), the data engineer triages to see what’s going on. Sometimes it’s a problem the engineer can fix), so they fix it and resolve the issue (like the nurse treating an injury on the spot). In other cases, the engineer redirect the issue to the appropriate owner (as in a referral to a doctor or other health practitioner). In short, the issue determines who the owner should be.

Using Stemma

Use Stemma to assign and document ownership. Divide ownership as follows:612

Stemma makes it easy to assign table ownership broadly: see Assigning owners in bulk

Delta Lake Integration

Proceed as follows to create a Delta Lake connection to Stemma.

Step 1: Generate a Personal Access Token

Follow these instructions to generate the Delta Lake personal access token that Stemma will need to extract your data into the catalog.

Step 2: Provide the Personal Access Token and Hostname to Stemma

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Delta Lake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows:
  • Provide the Personal Access Token you created in Step 1.
  • Provide the fully-qualified domain name of the Delta Lake host; for example mydlhost.mydomain.com.
  1. When you are finished, click Save Connection.

How Can I Tell What Stemma Automates and What I Need to Curate?

Automated documentation


Stemma provides out-of-the-box integration with applications such as Snowflakedbt, and dashboarding systems such as Looker to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.

The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.

Stemma also allows linking of existing Slack conversations about data to the data catalog.

Curated documentation

Items you need to curate include editable descriptions, tags, and owners. All of these can also be modified by APIs. APIs can also ingest custom metadata.

In the Stemma UI, items that you can curate will show a pencil when you hover your cursor over them; for example:389

More information here.

If there's a Curated and an Ingested Description, Which Will I See in Stemma?

If Stemma ingests the description of a table or column (for example, from Snowflake) and you subsequently edit or replace that description in Stemma, Stemma users will see that curated description from then on, even if Stemma subsequently re-ingests the original Snowflake description (or any other version of it).

At present, Stemma does not provide a way for you to see both the Stemma description and the ingested description side by side.

Assigning Ownership of Tables

You can assign ownership of a single table to individuals or teams. You can also assign owners to a set of tables in bulk, adding all the owners in a single transaction.

Note: You can assign owners to tables in Stemma, but not to dashboards. Stemma ingests ownership of dashboards from the underlying BI tool(s) and you need to make updates and modifications in those tools.

The importance of ownership

Ownership is one of the fundamental pillars of a data catalog. It defines responsibility by individual users and teams, adding tremendous value to the data in the catalog, including:

  • Delegation of duties: Which users are responsible for keeping this table properly documented?
  • Compliance with processes and procedures: Has the right team updated the documentation for this table before the newest pull request is merged?
  • Reporting and analytics: Which teams have the highest percentage of their assets documented? The lowest?

For advice on assigning ownership, see Best Practices for Table Ownership

Assigning ownership to an individual table

You can assign ownership of a table either to individuals or teams:

  • Individuals are the people in your company who use Stemma. You can find individuals by their email address or a Slack handle.
  • Teams are represented in Stemma by Slack channels.

To assign one or more owners to an individual table, choose the table in the Stemma UI and click on Add Owner at the bottom of the left panel.

But Stemma also allows you to bulk-assign table ownership across the catalog. You can do this in a single transaction, immediately allowing your users to identify the assets that belong to them.

Assigning owners in bulk

You may think about ownership as a hierarchy, or your teams may typically own all of the tables within a single database or schema, or perhaps your analysts use Dbt to create their own workflows and create tables with specific naming conventions (e.g. marketing_salesmarketing_prospects). Regardless of how your company thinks about grouping assets, Stemma can help you to bulk-assign owners, using advanced search to find any of the assets you want to update and assigning the owners with a single command.
To assign ownership of tables in bulk, proceed as follows:

  1. From the home page of the Stemma UI, navigate to Advanced Search.
  2. At the top of the left panel, make sure only the Tables resource is checked (you can’t bulk-assign owners to dashboards).
  3. Optionally use the filters in the left panel to limit the search results to the set of tables to which you want to bulk assign ownership:
    • Stemma suggests you limit your search results using only the filters on the left; while you can also use the search bar at the top, doing so makes it more likely that tables that you don’t want included slip into the results
  4. Click on Bulk Edit at the bottom right of the page and choose Add Owners.
  5. In the dialog box, enter or select an email address or Slack channel and choose Add.
  6. To add more owners, repeat Step 5 for each new owner.
  7. When you are finished, click Add Owners and then Done.

Stemma automatically pulls in the teams you’ve defined as well as the list of users in your organization:

Note: When you assign ownership in bulk, this does not trigger automatic messages to the owners and users; that only happens when you change to assign ownership individually.

Displaying Data Lineage in the Stemma UI

Tables

  1. Navigate to the details page for the table.
  2. Click on Lineage graph in the menu bar at the top of the page.
  3. You will see something like this:

From here you can:

Adjust the view: zoom in, zoom out, center, and fit to the page:

then choose Table Details to Table Lineage

See the table’s details page in the Stemma UI and a new lineage graph for each upstream and downstream table: click on the arrow and then choose Table Details or Table Lineage:

See and highlight the lineage of each table:

Collapse tables into groups: first hover over the logo:

hover over the logo

Then click the stack:

click the stack

To expand the groups again:

First click the down arrow
Then click the stack

Search, for an exact match only, on table properties. (In this example, statewide_testing will match, but not  statewide testing):

Columns

  1. Navigate to the details page for the table.
  2. Click on Columns in the menu bar at the top of the page.
  3. Click on the up-arrow next to the column whose lineage you want to see.
    Stemma shows the top five upstream and downstream columns.

Dashboards

  1. Navigate to the details page for the dashboard.
  2. Click on Lineage graph in the menu bar at the top of the page.
  3. You will see something like this:

From here you can:

Adjust the view: zoom in, zoom out, center, and fit to the page.

See the tables used by the dashboard and the columns used from each table:

See and highlight the lineage of each table:

Show the table’s lineage in an individual graph and the table’s details page in the Stemma UI; click on the arrow and then choose Table Details or Table Lineage:

Search, for an exact match only, on table properties (for example, statewide_cases will match, but not  statewide cases).

Collapse tables into groups and expand the groups again (see Tables).

Using Data Lineage Graphs

Use Data Stemma’s Lineage Graphs to:

Some examples

  • An analytics engineer builds a table using Dbt; that table becomes popular and several dashboards are built on top of it. It becomes well trusted, and the logic becomes a standard for calculating certain metrics. But the table is missing several important entity IDs (foreign key columns) which makes it difficult to join to other tables. The analytics engineer and a data engineer work together to complete a “micro migration” to a new, more performant and reusable table.
  • A machine-learning model is used to predict churn for a customer. Each day after the model runs, users that are predicted to churn are identified in a table in Snowflake and then sent an email to re-engage them. The product team is launching the second major product and wants to measure churn on a per-product basis. The engineering team updates their data pipeline to add a new column to the output table that identifies the product that churn is being predicted for.
  • An analyst created several views to consolidate the common logic that the team were using to query tables in their dashboards. But the performance of the views has deteriorated as the amount of data has grown. The analyst and the engineer work together to create materialized versions and to update all of the dashboards using the prior views.

Use Data Lineage in Stemma to Identify Affected Assets

The challenge with such changes is that it is hard to accurately curate a list of all assets that will be affected. If you cannot be certain of what will need to be changed or what could break, you cannot do risk analysis or even make a time estimate. Stemma enables you to do the analysis and create the appropriate messaging.

The key question for this type of change is “What tables and dashboards will I need to update if I rename this view?”

Stemma provides a lineage graph for each table and dashboard. From here you can see who and what is utilizing your resources downstream, and how popular your assets are. Data owners can use this in conjunction with their own knowledge to build roadmaps and plan how data should evolve.

Use Stemma Messaging to Communicate Expected Changes

Everyone affected needs to know what is going to change and when. After building your list of affected assets and their users, use Stemma messaging to provide advance notice and instructions.

How Stemma Discovers Data Lineage

When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.

In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.

Note

When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.

See also: How Stemma Captures Data: SQL Parsing Overview and Caveats

Google OIDC Integration

This document walks you through the steps to create and configure a Google OIDC application that Stemma can use to authenticate users.

Note: You must be a Google administrator to do this.

Steps

  1. Log in to Google Developer Console
  2. On the left, select Credentials
  3. Select + CREATE CREDENTIALS at the top

1664

  1. Select “OAuth Client ID”

948

  1. Enter the following information:
  • Application type: Web application
  • Name: You can choose the name; we suggest “Stemma OIDC” as a descriptive example
  • Authorized JavaScript origins: This must be “https://your-company.stemma.ai
    • Replace your-company with the correct value from your Stemma domain
    • If you have a custom domain name for Stemma, use that value here
  • Authorized redirect URIs: Provide the following URLs; make sure you replace your-company with the proper value:

1034

  1. Select Create at the bottom.

758950

  1. Download the Client ID and Client Secret JSON file and provide them to Stemma by email or Slack.

How Do I Add a User?

At present you need to contact Stemma support ([email protected]) with the names of the users you want to add. Stemma will add them for you, and you then need to authenticate them, making sure they have the access they need to do work in Stemma (see the sections on integrating SSO apps listed here) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake.

Hive Integration

What we need from you

Stemma needs certain information and credentials to extract Hive metadata into the catalog. Contact Stemma by email or Slack, and provide the following:

  • Metastore Host: Hostname is the IP address of the Hive server to which you are connecting.
  • Username and Password: You will need to provide a username and password for Stemma to use to access the Hive schema.
  • Metastore Port: Server port used for accessing metadata about hive tables and partitions. The default Hive metastore port is 9083.
  • List of Databases: Stemma whitelists databases, and so you will need to provide a list of the databases we will be importing.

Metadata extracted

The metadata Stemma extracts includes:

  • TBLS – stores basic information about Hive tables, views, and index tables.
  • DBS – stores the basic information of all databases in Hive.
  • PARTITION_KEYS – the field information of the table storage partition.
  • TABLE_PARAMS – stores the attribute information of the table/view.
  • SDS – saves the basic information of file storage, such as INPUT_FORMAT, OUTPUT_FORMAT, whether it is compressed or not.
  • COLUMNS_V2 – stores the field information corresponding to the table.
  • PARTITIONS – stores the basic information of table partitions.

How Do I Update an Existing Connection?

Updating a Connection

At present you need to add a new connection and delete the old one. You need Stemma’s help to do this.

First add a new connection as follows:

  1. In the Stemma UI, choose Admin from the toolbar at the top.
  2. Click the button for Add a New Connection.
  • Do not try to edit the existing connection.
  1. In the pop-up window, choose the connection type and provide the information as prompted. See What We Need From You.
  2. Click Submit.
  • This creates a new connection that is not yet active.

Now, contact Stemma support ([email protected]) with a request to activate the new connection. Stemma will let you know when the connection is active.

Once you have verified that the new connection is active and working properly, you can delete the old connection:

  1. In the Stemma UI, choose Admin from the toolbar at the top.
  2. Click the Delete button next to the connection you want to delete, and in the pop-up window confirm that you want to delete it.

Sharing additional metadata via CSV files

Along with a direct connection to various data sources, Stemma can ingest data using extracts in various file formats. One of the most commonly used file formats is CSV.

Below are some of the CSV Formats Stemma ingests today (along with the sample files), with the column names and values we expect in each file. Note that you need the column header literals (for example,db_name, etc.) as well as the values.

Github Links

Used to provide a link to a Github source from Stemma’s Table detail page.

db_nameclusterschematable_namesourcesource_type
hivegoldtest_schematest_table1https://github.com/amundsen-io/amundsen/github

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
source: the Github URL for the repository
source_type: static value, “github”

Airflow

Used to generate a link between a table and an Airflow task. Once the data is ingested, you will see an “Airflow” button on the Stemma detail page for the table.

task_iddag_idexec_dateapplication_url_templatedb_nameschematable_namecluster
hive.test_schema.test_table1event_test2018-05-31T00:00:00https://airflow_host.net/admin/airflow/tree?dag_id=SUPER_AWESOME_DAGhivetest_schematest_table1gold

task_id:: the Airflow ID for the task
dag_id: the Airflow ID for the DAG
exec_date: the most recent execution timestamp for the task
application_url_template: the URL for the DAG in your Airflow instance
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
cluster: the name of the database within the data source.

Dagster

Used to link a Dagstr op and/or job to a table in Stemma. Once the data is ingested, you will see an “Dagstr” button on the Stemma detail page for the table.

application_url_templatedb_nameclusterschematable_name
https://dagster_host.net/dagster/tree?job=dagster_jobsnowflakeprodtest_schematest_table1

application_url_template: the URL for the DAG in your dragster instance
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.

Dbt Cloud

Used to link a Dbt Cloud job to a table in Stemma. Once the data is ingested, you will see a “Dbt” button on the Stemma detail page for the table.

application_url_templatedb_nameclusterschematable_name
https://dbtcloud_host.net/dbtcloud/tree?job=dbtcloud_jobsnowflakeprodtest_schematest_table1

application_url_template: the URL for the DAG in your Dbt Cloud
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.

Table Tags

Stemma can create and assign tags via a CSV file, with the following format:

db_nameclusterschematable_nametags
hivegoldtest_schematest_table1“tag1,tag2”
hivegoldtest_schema_2test_table2“tag1,tag3”

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
tags: the tags to be created. Tags are additive: existing tags on tables will not be modified.

Table Description

db_nameclusterschematable_namedescription
hivegoldtest_schematest_table1Description for table, optionally markdown.

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
description: the table description. Note that descriptions are destructive: if a description already exists, it will be overwritten.

Table Owners

db_nameclusterschematable_nameowners
hivegoldtest_schematest_table1[email protected],[email protected]
hivegoldtest_schema_2test_table2“#test-channel,#test-channel-2”

db_name: the data source name, for example, hive, snowflake, athena, redshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
owners: email addresses or slack channels. Owners are additive: existing owners of tables will not be modified.

Column Descriptions

db_nameclusterschematable_namecol_namedescription
hivegoldtest_schematest_tabletest_columnThis is an example column description.

db_name: the data source name, for example, hive, snowflake, athena, redshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
col-name: the column described.
description: the column description. Note that column descriptions are destructive: if an existing description is present, it will be overwritten.

JIRA Integration

This document walks you through the steps you need to take to integrate Stemma with your JIRA installation. You must be a JIRA Admin to do this.

Steps

Step 1 – Collect information about your JIRA environment

Collect the following information to provide to Stemma (see Step 3):

  • The URL for your JIRA environment (e.g., <https://jira.net>)
  • The project ID for the JIRA project you would like Stemma to create tickets in
    See this page for guidance from Atlassian on how to obtain the project ID for a JIRA project
  • If using JIRA Server or Cloud, the ID for the JIRA Bug issue type
    • Note: Skip this step if using JIRA on-premise
    • See this page for guidance from Atlassian on how to obtain the ID for the Bug issue type
  • Optional: The name of one or more labels you would like set on tickets created by Stemma

Step 2 – Create a service account

Stemma will need an email and API token to authenticate to your JIRA environment. Our recommendation is to create a JIRA service account so that the authentication is not tied to an actual user’s account.
a. Go to Administration > User management > Users
b. Click Create User and fill out the required information
c. Create an API token from the service account

  • For JIRA cloud see this page
  • For JIRA server see this page
  • For JIRA on-premise this feature is not currently supported so you will provide the password instead

d. Grant permission to the service account

  • The service account will minimally need access to create and view tickets in the JIRA project you intend to use
  • You can either grant permissions directly to the service account or you can grant the service account a role

Stemma needs:

  • The email address of the service account
  • The API token or password

See Step 3.

Step 3 – Provide the information to Stemma

Provide the information from Steps 1 and 2 to Stemma by email or Slack.

Linking JIRA Issues to Stemma

To link a JIRA issue to a table in the Stemma catalog, you need to include a Table Key tag in the issue description. The tag format is Table Key: {SOURCE}://{DATABASE}.{SCHEMA}/{TABLE} [PLEASE DO NOT REMOVE].

Finding Important Information about Tables and Columns

Stemma provides out-of-the-box integration with applications such as Snowflakedbt, and dashboarding systems such as Looker to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.

The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.

The table details page provides following information:

Table attributes

Asset status:

This can be CertifiedDeprecatedIntermediate, or No Status:

336

Description:

A brief summary of the contents of the table:

338

See Advanced Description Editing for more information.

Issues:

JIRA issues logged against this table:

441

This section explains how to link a JIRA issue to the Stemma catalog.

Owners:

The person or people responsible for maintaining this table:

441

Click on an owner to see other assets owned by that user, as well as their bookmarks and frequently used tables. You can also email the user via this profile page.

See also Assigning Bulk Ownership of Tables.

Last Updated:

The date and time of the last update to the table (this can be different from the most recent date covered by the data itself):426

Date Range:

The date range covered by the data in the table.

173

Note that this need not be synchronous with the Last updated date and time, which simply reflects the date of the most recent modification to the table.

Tags:

Descriptive tags for cross-referencing, such as covidcases, etc.:

187

Frequent Users:

People who have frequently used this table:

288

Click on a user to see assets owned by that user, as well as their bookmarks and other frequently used tables. You can also email the user via this profile page. See also How Does Stemma Calculate “Top 5 Columns”, “Frequent Users”, etc?

Commonly Used With

Links to tables often used with this one (see Using Commonly Used With). See also How Does Stemma Calculate “Top 5 Columns”, “Frequent Users”, etc?

288

Table lineage graph:

See Displaying Data Lineage in the Stemma UI :

Dashboards:

Dashboards that use the table:864

See also Finding Important Information about Dashboards.

Upstream, Downstream:

Upstream and downstream tables:

:576576

See also Displaying Data Lineage in the Stemma UI

Slack Conversations:

Slack threads that mention this table:864

Column attributes

Column description:

Brief description of the column contents:864

See Advanced Description Editing for more information.

Data type:

Text or Number:

750

Column Statistics:

For example, minimum and maximum values over a period:720

Shared descriptions:

See Sharing Column Descroptions.

Column usage and lineage:

The most-used upstream and downstream columns, as measured by the number of queries, dashboard views, catalog views, etc.:

288

See also Tracking Data Lineage and How Does Stemma Calculate “Top 5 Columns”, “Frequent Users”, etc?.

Curated versus automated metadata

Where possible, Stemma captures the information directly from the source. This table shows which attributes are populated from the source and which you can edit:.

Table MetadataAutomatically captured?Users can edit?
Table description.
Asset status.
Owners✅ (dbt)
Tags✅ (dbt)
Jira issues
Last updated time
Data date range
Commonly used with tables
Frequent users
Upstream & downstream table lineage
Dashboards that use the table
Slack threads with table mentions

Notes

Editing curated items

To add or edit a curated item, hover you cursor over over it until you see a pencil; then click on the pencil and add or edit text in the resulting box.389389

Using Commonly Used With

Clicking on a link below this heading opens a dialog window. This window provides details of the usage; for example:1008

Looker Integration

This document walks through the steps required to create a service account in your Looker, which will be connected to your Stemma Instance for dashboard metadata extraction.

Stemma uses the Looker APIs (via Looker Python SDK) to extract the metadata of Folders, Dashboard, Dashboard Elements and Explore Queries, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Looker. Read Looker’s docs: https://docs.looker.com/reference/api-and-integration

Stemma takes leverage of the API3 keys (Client ID & Client Secret) to make a secure connection between your Stemma and Looker instances. Below you can find out the steps required to create the new API3 keys for this integration.

What is extracted?

Stemma extracts information about Folders, Dashboard, Dashboard Elements, Explore Queries, and data sources associated with them.

Stemma will not extract Dashboards that are deleted, hidden or in a user’s personal folder.

Steps

Step 1: Make sure you have a Looker Admin role

Looker → Stemma integration requires an Admin role to fetch the correct data and build the relationships between the entities like Table Lineage, Dashboard Owners, etc.

So, as a first step, make sure you have an Admin role.

Step 2: Generate API3 keys

The next step is to create an API3 Key. From the Admin menu, navigate to the Users page.

Find the user you want to use to generate the API3 key, and click the “Edit” button under the Actions column.

Once you are on the “Edit User” page, find the API3 Keys settings, and click “Edit Keys

Click “New API3 Key” to generate a new Client ID and Client Secret.

Step 3: Provide the details to Stemma

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Looker as the Connection type, provide a Connection Name, and then fill out the fields in the dialog box as follows:
    • Base URL
      Example: <https://stemma.cloud.looker.com/>
    • Client ID
      The one you generated in Step 2
    • Client Secret
      The one you generated in Step 2
  2. When you are finished, click Save Connection.

What Effect Does Deprecating a Table Have?

You can set the status of a table or dashboard as deprecated or intermediate. When you do so, that asset remains in the Stemma catalog, but is removed from default searches (you can change this by means of filters in Advanced Search).

When you deprecate an asset or change its status to intermediate, you should modify its description accordingly, and you should also be careful to message all users and owners of the asset (including users who might be using the asset indirectly, such as dashboard users of a table) to warn them about the change in status. This allows you to make sure you hear from all stakeholders before you remove or move the asset.

What Happens when a Table or Dashboard Is Removed?

Removing a table or dashboard causes it, by default, to be removed from Stemma after ten days. Similarly, if the table or dashboard is renamed or moved, the original entry is removed from Stemma after ten days.

Before removing an asset, you should deprecate it to give stakeholders sufficient warning.

Advanced Description Editing

The Stemma UI provides a WYSIWYG editor for descriptions of tables and columns.

Using the editor

  1. To begin editing, navigate to the table or column whose description you want to edit, and click on the pencil icon next to Description:

This opens up the text box for editing:

  1. Add or change the description.
    • Descriptions can include links and images as well as text.
    • You can choose font styles as indicated by the buttons.
    • Use the pull-down menu to toggle the text format from the default (Normal) to common formatting elements such as headings, lists, quotes, and code blocks.
      • When you choose Code Block, you can choose the type of code from the secondary pull-down menu (the default is JavaScript).
    • You can add hyperlinks: enter and highlight the text to display, then click the 🔗button and enter the URL (and edit it if necessary) in the text box next the button.
    • You can also add links to other tables.
    • The editor also supports Markdown syntax, as well as any shortcuts your system supports (such as ⌘-b for bold on a Mac)
  2. When you are finished, click Update to save your changes.

Linking to other tables from the description

This is useful, for example, if you want to give other users more information about the relationship between the tables.

To link to another table from the description, do the following:

  1. In the Stemma UI, navigate to the table whose description you want to edit.
  2. Click on the pencil icon next to Description.
  1. Enter the new desription providing the related table’s<schema_name>.<table_name> anywhere in the text.
  1. Click Update to save your changes.

The new description will contain the link to the related table:

Sharing column descriptions

You can share a column description with all columns of the same name across the catalog. To do this:

  1. In the Stemma UI, navigate to the table whose column description you want to share.
  2. To add or modify a description, click on the down-arrow to the left of the entry for the column:
  1. Hover over the word Description and click on the pen that appears:
  1. Add or update the description and click on Update. Stemma automatically discovers all columns with the same name, and the description propagates to all linked columns. To see which columns are linked, and add and remove links, click on the three dots to the right of the column entry:

Click on the Edit linked columns pop-up. In the dialog window, you can:search for and add columns to link, and remove existing links.

When you have finished, click Done.

Mode Integration

The Stemma Mode integration accesses the Mode Discovery API. This API is distinct from regular Mode API: it is much more performant but only contains a subset of the available endpoints. This means Stemma is able to access only metadata about reports, not the report runs/results. Stemma also accesses the regular REST API to augment the information retrieved from the Discovery API. In all, there are four values required by Stemma:

  1. Your organization name in Mode
  2. The access token to access the REST API
  3. The password token to access the REST API
  4. A signature token to access the Discovery API

Optionally, if you would like Stemma to only extract metadata from specific Mode spaces you may also provide Stemma a list of the spaces to include.

Creating the Access Token & Password Token

You can learn about how to create this access here (https://mode.com/developer/api-reference/authentication/).

The access token and password token allow Stemma to augment the information within the Discovery API with the regular REST API to unlock two additional capabilities:

  1. The ability to link your dashboards to the underlying tables in your warehouse which are used to serve the dashboards.
    • Stemma will use the general Mode API for datasources to retrieve your data source information to find out the name of the database (you may use, PUBLIC, mktg, etc.) as well as the type of data source (e.g. Snowflake, Redshift, etc.) in order to apply the proper SQL parsing semantics.
  2. The ability to rank your dashboards by the number of times the dashboard has been viewed in the past 90 days.
    • Stemma will use the general Mode API for reports API to retrieve an accurate view count.

Creating the Signature Token

To authenticate with the Mode Discovery API a signature token must be created. This signature token can be created by running a curl command using the account’s REST API access token and password secret. The following code is a snippet from Mode that has been slightly adjusted to create a signature token that Stemma can use. Make sure to replace {organization} , {your-access-token} and {your-access-secret} with the appropriate values (organization can be found in the URL when you visit Mode, e.g. https://app.mode.com/<organization>/spaces). Visit the Mode documentation for information on creating the api-key and api-secret.

curl --location --request POST 'https://app.mode.com/batch/{organization}/signature_tokens' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--user '{access-token}':'{your-access-secret}' \
--data-raw '{
    "signature_token": {
        "name": "stemma-api-token",
        "expires_at": "2024-10-02T21:00:00+0000",
        "auth_scope": { "authentication_for" : "batch-api", "authorization_type": "read-only" }
    }
}';

Please note, if this does not work it likely means you do not have permissions. Please contact Mode support with help creating a read-only signature token.

APIs used

APIs are listed in the table below. In the table below, Data Extracted is the data we actually store from the API (and may change as we add functionality). Risks highlights any potentially-sensitive information in the API (whether or not we actually store it).

Please note, that the “Report Stats” endpoint is not enabled by default in mode, so functionality coming out of it may not be available, until the endpoint is enabled.

NameDocsData ExtractedRisks
Chartshttps://mode.com/developer/discovery-api/analytics/charts/Chart name, type, query idLow: contains employee email addresses
Reportshttps://mode.com/developer/discovery-api/analytics/reports/Execution times, dashboard names, dashboard descriptions, creation/last modified times, owner email addresses,Low: contains employee email addresses
Report Statshttps://mode.com/developer/discovery-api/analytics/report-stats/Usage countsLow[1]
Querieshttps://mode.com/developer/discovery-api/analytics/queries/Name, Raw SQLModerate[2,3]
Membershipshttps://mode.com/developer/discovery-api/analytics/members/Mode usernamesLow: contains employee email addresses
Reports: detailedhttps://mode.com/developer/api-reference/analytics/reports/Dashboard view countsLow: contains metadata about the report itself, not the underlying executions of the report
Data Sourceshttps://mode.com/developer/api-reference/management/data-sources/Mode data source name and data source type.Low: contains access to information about the data source. Stemma should separately have access to this information but this API access is required to link a mode dashboard to a specific database.

Notes

Here are some additional notes about possibly sensitive information we may receive from Mode

  1. Dashboard execution times are non sensitive on their own, although it is possible to infer NPMI from who is running which dashboard queries at precise times. Stemma only displays aggregated forms of this information, but stores it disaggregated.
  2. Query names may contain NPMI (e.g. a query titled “EMEA Expansion Potential”, when the company hasn’t announced its intention to expand into EMEA).
  3. Raw SQL queries may contain sensitive information (PII), typically in the form of filter clauses. This is generally not allowed under most security policies especially under Open Collections, however, it is a possibility. By default, these SQL statements will be shown to users in Stemma.

Configuring Your Stemma Connection

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, optionally add a name for the connection, and choose Mode from the Connection Type drop-down menu.
  1. In the dialog window, provide the signature token and Mode organization name, and optionally the access token and password token..
  2. When you are finished, click Save Connection.

How Does Stemma Calculate "Top 5 Columns", "Frequent Users", etc?

The table details page in the Stemma UI provides usage information, derived as follows:

  • To calculate the top five upstream and downstream columns, Stemma multiplies the number of unique users by the number of queries to get the ranking. The time frame is 30 days.
  • To identify tables commonly usesd with this one, Stemma counts the joins that were part of any query in the past 90 days.
  • To identify frequent users of a table, Stemma counts any user who has run a query on this table in the last 30 days.

Finding Important Information about Dashboards

Stemma provides out-of-the-box integration with dashboarding systems such as Looker to capture rich metadata from these systems automatically.

The Stemma UI makes it easy for you to discover this important information about your dashboards, and provides a convenient way to add and modify curated information.

Status:

This can be Certified, Deprecated, Intermediate, or No Status:

288

Description:

A brief summary of the contents of the dashboard:

360

Owners:

The person or people responsible for maintaining this table:

288

This information is extracted automatically from the source and (unlike table ownership) can’t be edited in Stemma.

Tags:

Descriptive tags for cross-referencing, such as Covid, cases, etc.:

288

Created:

The date and time the dashboard was created:

360

Last successful run:

Date and time the dashboard was last run successfully:

360

Last Updated:

Date and time of the last update to the dashboard:

360

Last run:

Date and time the dashboard was last run (successfully or not):

360

Total View Count:

Total number of views over the life of the dashboard:

360

Tables:

Tables used by the dashboard:720

Charts:

Charts available from the dashboard:

360

Queries:

Queries run from the dashboard. Click on a query name to see the code:432

Open folder:

Open the dashboard’s folder in the underlying application (you may need to log in to the application).

Open dashboard:

Open the dashboard in the underlying application (you may need to log in to the application).

Okta Integration

This document walks you through the steps to create and configure an Okta application that Stemma can use to authenticate users.


Note: You must be an Okta administrator to do this.


Steps

  1. Log into Okta admin portal
  2. In the left frame, select Applications :
  1. Select Create App Integration:
  1. Select the options for OIDC – OpenID Connect and Web Application.
  1. In the New Web App Integration, enter the following values:
    • App Integration Name: Your app name, we suggest Stemma Data Catalog
    • Grant Type: select the following –
      • Authorization code
      • Client Credentials (Client acting on behalf of itself)
      • Refresh token
    • Sign-in redirect URI:
      • Use the fully qualified hostname and add the endpoint:
        • /auth
        • If you are running a hosted version of Stemma, this should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name
    • Sign-out redirect URI
      This should be the base Stemma URL. Example:
      • https://<customer>.stemma.ai
    • Trusted Origin:
      • If you are self-hosting Okta, provide the Base URI for your Okta Sign domain, otherwise, leave this blank
    • Assignments:
      • Select the assignment access you would like within your organization
  1. Stemma will require access by certain APIs to retrieve user information from your organization’s directory.
  • Navigate to Okta API Scopes :
  • Make sure to grant permission for the following scopes:
    • okta.users.read
    • okta.users.read.self
  1. Provide the following values to Stemma by email or Slack so that authentication can be built into your Stemma deployment:
    • Okta Discovery endpoint (well-known configs)
    • Client ID
    • Client Secret

752

Using Admin Functions and Reports

API Keys

Allows you to add and delete keys for the Graphql API.

Connections

Allows you to add, modify and delete connections to Stemma. For instructions on adding connections, see Integrating Your Applications with Stemma: What We Need from You. See also How Do I Update an Existing Connection?

Reports

These reports provide an account of your organization’s interactions with the Stemma catalog, including curating work that needs to be done, as well as usage.They can be especially useful when you are rolling our Stemma to your users and scoping the work remaining to do.

  • Customer Summary
    Shows the number of data sources, tables, and dashboards that are integrated
    into the Stemma catalog.
  • Monthly product usage
    Shows the number of users and searches by month.
  • Documentation summary
    Shows the percentage of columns, dashboards, and tables missing documentation and/or ownership.
  • Assets without docs
    Shows the specific tables and dashboards that lack documentation. Click on the drop-down to switch between the two types of asset

864

  • Assets without ownership
    Shows the specific tables and dashboards that lack ownership (click on the drop-down to switch between the two types of asset).

288

Redshift Integration

What we need from you

Stemma needs certain information and credentials to extract Redshift metadata into the catalog. To provide these, proceed as follows.

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Redshift as the Connection type and then provide the following information in the dialog box:
    • A username and password for Stemma to use to access the INFORMATION_SCHEMA
    • The hostname of the Amazon Redshift cluster. You can find it in the Properties tab of your Redshift Cluster Details page. Under the Connection details section, copy the Endpoint up to but not including the colon.
    • A port number. The default port number for Redshift is 5439. From your Redshift Cluster Details page, you can find the port number under the Database configurations section of the Properties tab.
    • A list of databases: Stemma whitelists databases, so you will need to provide a list of the databases we will be importing.

Access

The Stemma Redshift integration requires access to two pieces of information within the Redshift install:

  • information_schema metadata
  • Query logs

information_schema

The specific tables and views we access include:

  • PG_GET_LATE_BINDING_VIEW_COLS
  • svv_external_columns
  • COLUMNS

Because of the way permissions work in Redshift, these tables are only populated with data to which the user has SELECT access. If you don’t want to use column stats, and therefore want to restrict access for this credential, Stemma will give you the SQL code required to create a materialized view. You create that materialized view as a superuser, and grant our credential the ability to read and refresh that materialized view.

Query logs

Stemma Redshift requires query logs. We can extract these in two ways: push or pull.

Push

The recommended integration involves configuring Redshift to dump user activity logs into S3.

There are multiple methods to achieve this, either using CloudTrail or directly from Redshift. the exact right way differs between organizations (many already have audit logging configured for other purposes).

Pull

Stemma can extract logs from STL_QUERY. However, because this option requires the credentials be a superuser, it is not the recommended integration method.

Column stats – Optional

The Stemma integration user must have full SELECT access to any tables or views for which you want to see column stats.

How Stemma Captures Data: SQL Parsing Overview and Caveats

What Stemma captures

When you execute SQL against your warehouse, the database captures the raw SQL that was executed and saves it in a historical format that includes the time that the SQL command was executed as well as the user who executed it. Stemma reads these SQL logs and parses all of the SQL commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page:

  • Table and column lineage: any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
  • User table usage: Stemma uses select statements to associate users to the table(s) being queried; this provides the Frequent Users information on the Table Details page, search rankings, and the user’s profile page.
  • Commonly Used With: any joins that occur within the SQL, whether on the top-level query or as part of a sub-query, are captured and presented on the Table Details page to help other users understand how two or more tables are commonly joined.

Caveats:

  • When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
  • When user IDs in the database do not match the user portion of the user’s email address as recorded in Stemma, Stemma cannot associate the Stemma user to the user in the query logs. This prevents Stemma from creating any user-table usage information.
  • When system accounts are used to query the data, Stemma is often not able to see which user is executing the query. This is a common practice with BI tools when multiple team members share the same connection to a database and that connection uses a system account.
    • However, if you are using Mode, Stemma can parse the comments appended by Mode (which do contain the user ID) to find the user and associate that user to the query.
  • Commonly Used With does not capture joins where the left or right table is a dynamically generated select statement, for example: select * from table_a a join (select * from table_b b1 join table_c c1 on b1.id = c1.id) b on a.id = b.id.
  • Stemma currently supports Snowflake out of the box. Stemma can also parse RedshiftAthena, Postgres and other databases, but does not provide turnkey support for those databases at present.

Slack Integration

This document walks through the steps required to install a Stemma Slack application in your workspace.

Prerequisite

Stemma does not publish the Stemma Slack application to the Slack Marketplace. To install this application for your workspace, you need to create a new application using an application manifest file.

If you have an application manifest file, make sure you have it handy before you begin. If you don’t have a manifest file, you can copy and paste the following into a file:

_metadata:
  major_version: 1
  minor_version: 1
display_information:
  name: Stemma
  description: Help you enrich your metadata directly via Slack
  background_color: "#0a1930"
features:
  bot_user:
    display_name: Stemma
    always_online: true
  slash_commands:
    - command: /stemma
      url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/slash/help
      ## Replace {SUBDOMAIN} with your Stemma subdomain.
      description: Learn how StemmaBot works
      usage_hint: help
      should_escape: false
oauth_config:
  scopes:
    bot:
      - app_mentions:read
      - channels:history
      - channels:read
      - chat:write
      - commands
      - groups:history
      - groups:read
      - links:read
      - links:write
      - users:read
      - users:read.email
settings:
  event_subscriptions:
    request_url: https://{SUBDOMAIN}.stemma.ai/stemma/slack/events
    ## Replace {SUBDOMAIN} with your Stemma subdomain.
    bot_events:
      - app_mention
      - link_shared
      - member_joined_channel
      - message.channels
      - message.groups
      - team_join
  interactivity:
    is_enabled: true
    request_url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/interactivity
    ## Replace {SUBDOMAIN} with your Stemma subdomain.
  org_deploy_enabled: false
  socket_mode_enabled: false
  token_rotation_enabled: false

Save the file with any prefix and in any location you like, but make sure:

  • The filename has the suffix .yaml.
  • You replace {SUBDOMAIN} with your Stemma subdomain.

Steps

Step 1: Verify your application manifest file

Make sure that at least the following properties in the application manifest file match your Stemma endpoint:

  • features → slash_commands → url:
  • settings → event_subscriptions → request_url
  • settings → interactivity → request_url

Step 2: Create and install your Stemma Slack application

Create a new application using the application manifest file.

a. Navigate to: https://api.slack.com/apps, and click Create an App.

b. You will be prompted to select a method to create a Slack Application. Select From an app manifest, and choose the workspace in which you want to install the Stemma Slack application.1088

c. Paste in your manifest file. Make sure you select the YAML format, and paste in the contents of the file (see Prerequisite above).1088

d. Click Next and verify the scopes and events the Stemma Bot is requesting. Finally, click Create to install the application to your workspace.

Step 3: Provide Signing Secret and Bot Token to Stemma

To secure the communication between Slack and Stemma, you need to provide Stemma with the Signing Secret and Bot User OAuth Token of your newly installed Stemma Slack application:

  • Signing Secret:
    • You can find the Signing Secret on the Basic Information page, in the App Credentials section.

2144

  • Bot User OAuth Token:
    • You can find this token on the OAuth & Permissions page, in the OAuth Tokens for Your Workspace section.

2364

Now provide the Signing Secret and Bot User OAuth Token to Stemma:

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Slack as the Connection type, provide a Connection Name, and then in the dialog box enter the Bot Token and Signing Secret.
  2. When you are finished, click Save Connection.

Step 4: Receive confirmation from Stemma

Stemma will use the information you provide to integrate your Slack workspace with your Stemma Instance. Once you get the confirmation, you should be able to type /Stemma help in Slack to see usage help for the Stemma Slack application.

Step 5: Make sure to add an app icon 😊

Under the Stemma (Slack Application) basic information Tab, add the Stemma icon in the “Display Information” section. Users will be much more likely to use the bot if it’s easy to identify.1024

(Nice to have) Please invite Stemma team member(s) to a shared test channel

At this point, you have the Stemma Slack application installed within your Slack workspace. In order to verify if things are working just fine with no errors, you can create a temporary shared channel and invite a couple of members from the Stemma team to that channel.

For information about using Stemma with Slack, see Using the Stemma Bot in Slack.

Planning Changes to Tables or Columns

When you are planning to make a backward-incompatible change, you need to proceed in a way that minimizes the impact and gives downstream stakeholders as much warning as possible. Examples of backward-incompatible changes are:

  • Changing a schema (type changed, name changed)
  • Changing values (data fields will no longer be populated, or will be populated differently)

You need to understand and communicate the impact before you make such a change. This is usually done in the following steps:

  1. Understand the impact: understand who uses or queries this data.
  2. Provide heads-up notification: notify downstream owners and users that this data is going to change, and that they need to update their assets (downstream tables or dashboards) accordingly. Provide a time window after which the change will go into effect.
  3. (ideally): Understand the impact again: see if downstream users have changed their assets; how many still need to be nudged, etc.?
  4. Make the actual change and follow up to make sure that it’s complete.

Stemma can help you make the change as painless as possible. Use the UI to assess the effect on downstream users and on other assets. The following sections provide help:

If you are deprecating a table:

  • Change the description to warn users about the impending change.
  • Change the status toDeprecated.

Now warn all owners and frequent users about the change, and give them a deadline to make their adjustments.

Note

If you decide to change the name of a table to indicate that it’s deprecated (e.g., to something like deprecated_tablename), that will cause it, by default, to be removed from Stemma after ten days.

Snowflake Integration

What we need from you

Stemma needs certain information and credentials to extract Snowflake metadata into the catalog. To provide these, proceed as follows.

  1. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
  1. In the pop-up window, choose Snowflake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows:
    • Credentials
      We support authentication with a username and either a password or a private key.
      • Username and Password: You need to provide a username and password for Stemma to use to access the INFORMATION_SCHEMA as described above.
      • Username and Private Key: If you set up authentication via key pairs you will need to provide Stemma with a private key to use when accessing INFORMATION_SCHEMA as described above. If your private key is encrypted, you will also need to provide Stemma with the passphrase.
    • Snowflake Account Name: If you use the snowflake web interface, the URL you use should look something like abc.snowflakecomputing.com. Everything that appears before Snowflake computing is your account name.
    • Warehouse: The name of the warehouse (also called Virtual Warehouse) Stemma will connect to.
    • Role: The role Stemma will use once connected. If Stemma is extracting query logs, this role will need to have MONITOR access on the warehouse.
  2. When you are finished, click Save Connection.

Access

The Stemma Snowflake integration requires read access to all tables that will be included in the catalog, plus query logs.

Metadata

The specific tables and views we access include:

  • TABLES
  • COLUMNS

Due to the permissions model of Snowflake, these tables are only populated with data to which the role has SELECT access. Metadata will be extracted one database at a time since each Snowflake database has its own INFORMATION_SCHEMA.

Query logs

Stemma extract logs from the QUERY_HISTORY table function in the INFORMATION_SCHEMA. Logs are used to inform lineage, usage, common usage patterns and more. Due to the permissions model of Snowflake, this table is only populated with queries executed in virtual warehouses that the role has MONITOR access on.

Access History

Stemma uses ACCESS_HISTORY to enrich lineage derived from query logs. To grant Stemma access, Stemma’s user role will need to have the IMPORTED PRIVILEGES:

grant imported privileges on database snowflake to role stemma_role;

If Stemma cannot read Access History, lineage will still work however accuracy will be reduced.

Note: If you try giving the Stemma role access only to the required view, without giving access to the SNOWFLAKE database and ACCOUNT_USAGE schema before that, Snowflake will probably reject your request with the “Grant not executed: Insufficient privileges.” error.

Using the External GraphQL API

Creating and Installing an API Key for the External Graphql API

1. Create an API key in the Stemma UI:

a. Navigate to the Admin panel and choose Add API Key:

432

b. In the pop-up window, enter a name for your API key and click Submit. You can choose any name, but Stemma recommends using a name that describes the intended use of the key (e.g., github action integration):432

Stemma generates an API key that appears in the window. Save this key somewhere secure. It is not recoverable.

2. Install the Altair Graphql Client as one of these:

3. Add your API key to Altair as an Authorization header.

864

It must be in the form Bearer <api key> with a space between the word Bearerand your API key.

4. Enter your URL and HTTP method:

  • URL = /graphql
    E.g., https://preview.stemma.ai/graphql
  • HTTP method = POST

Now click Docs in the upper right of the Altair UI; you should see the graphql schema on the right hand side:864

Tableau Integration

Usage information

The Stemma Tableau integration will sum view counts for all workbooks in a particular dashboard over all time.

Installation Instructions

This document walks through the steps required to create a service account in your Tableau, which will be connected to your Stemma Instance for metadata extraction.

Stemma uses the Tableau Metadata API (GraphQL) to extract the metadata of projects, workbooks, dashboards, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Tableau.
Read Tableau’s docs: https://help.tableau.com/current/api/metadata_api/en-us/index.html

Connection Type:

We support both Online and Extract Connections with no differences.

Steps

Step 1: Decide your authentication strategy

The Tableau Metadata API is authenticated using a Personal Access Token (PAT) attached to a specific user within your organization. You will always share with us the PAT, not the username/password of the user itself. The PAT has the same permissions as the user to which it is attached.

Most users simply will simply create a PAT attached to their admin account. This is the easiest to set up and simplest to manage longer term, and we recommend it. For this option, skip to Step 4.

However, because this means the PAT has write access to the instance and Stemma doesn’t use write access at all, this doesn’t strictly follow principle of least privilege. In cases where restricting such access is critical, we recommend creating a read-only service account, and then you’ll generate a PAT for that limited account. Continue to Step 2 for this option.

Step 2: Create a new Viewer user

If you decide to create a new user for this integration (otherwise skip to Step 4 to use your current user to generate the Personal Access Token), navigate to the Users menu from the side navigation.

Click on “Add Users” and select “Add Users by Email“. A popup will appear with a Form to create new Users.

Select the “Tableau” radio button, and write down the email address you want. You will not share this email address with Stemma, and use this to sign in to the Tableau instance in order to create a PAT.

Email Tip: If you use Google email, [your_current_user][email protected][your-domain] will allow you to receive emails on your current email address and satisfy the uniqueness requirement.

Select “Viewer” as the Site role, which will give this user minimum access to the Tableau, and click “Add Users” to finalize the user creation.

Step 3: Set up permissions for the user

The next step is to make sure the user has permission to read your externally connected data sources. This will be used to link the workbooks with the tables in Stemma. By default, a “Viewer” user will not have access to any data source.

Click on the “External Assets” from the side navigation, click on “Select All“. A new dropdown will appear. Click on the “Actions” dropdown and select “Permissions“.

A Permissions dialog will appear under the “Permission Rules” section. By default, you will be on the “Database” tab.

Click “+ Add Group/User Rule” and select the user you created in Step 2 of this document, set the Template “View” and Save the rule.

Click on the “Tables” tab, and repeat the steps to add the rule for the user with the “View” template.

Stemma uses information about the upstream data sources in order to automatically create data lineage from your dashboard to your upstream table. In order to enable this, Stemma needs access to read the data source metadata. To enable the Stemma user permissions required by the Tableau Metadata API, you will need to provide “View” and “Connect“” access to each data source and workbook that Stemma will read from.

Step 4: Create a Personal Access Token

Log in to your Tableau instance with the user you want to use for Stemma integration. If you created a new user for this integration, make sure you log in as that user.

Select My Account Settings from the dropdown of your profile.

Scroll down to the “Personal Access Tokens” section, enter the token name “stemmaViewer” and click “Create new token”.

A new pop-up will appear with the personal access token and secret. Make sure you copy these somewhere safe as they will not be displayed again.

Step 5: Provide the details to Stemma

a. Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:

b. In the pop-up window, choose Tableau as the Connection type, provide a Connection Name, and then in the dialog box fill out the fields as follows:

  • Base URL
    Example: <https://sample.online.tableau.com>
  • Access Token Name (see Step 4).
  • Access Token Secret (see Step 4).
  • Tableau Site Name

c. When you are finished, click Save Connection.

Optional:

You can opt to exclude or include some items by providing the following to Stemma by email or Slack:

  • Name of Projects to Exclude: If you want to exclude any sample projects or test projects from Stemma, provide the names of those projects.
  • External Assets Types to Include (Connection type):This is uncommon, but if you want to include and link only certain types of external assets with your Tableau workbooks in Stemma, provide the names of those assets you want to include and link. Note that this may limit the user experience of Stemma, and is not recommended.

Connecting to AWS S3

Stemma supports two methods of transferring files:.

  1. Stemma creates an S3 bucket: you provide Stemma the ARN for a role/group that should have access to read to and write from the bucket. You then provide the role/group to the users or resources that need access to the bucket. Stemma recommends this method.
  2. You create the S3 bucket: Stemma provides you the ARN of the role that the Stemma app uses. You will need to add the following policy to the bucket to allow Stemma access. Make sure you replace <BUCKET_NAME> and <ARN_FROM_STEMMA> with the appropriate values.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "StemmaBucketAccess",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<BUCKET_NAME>/*",
                "arn:aws:s3:::<BUCKET_NAME>"
            ],
            "Condition": {
                "ArnLike": {
                    "aws:PrincipalArn": "<ARN_FROM_STEMMA>"
                }
            }
        }
    ]
}