Warning: Attempt to read property "ID" on null in /nas/content/live/stemmadocs/wp-content/plugins/eazydocs/templates/onepage/default-layout.php on line 15
Warning: Attempt to read property "ID" on null in /nas/content/live/stemmadocs/wp-content/plugins/eazydocs/templates/onepage/default-layout.php on line 84
Using the Stemma Bot in Slack
Stemma’s goal is to capture information about data no matter where it happens. Slack is increasingly where conversations about data exist, and Stemma has a super simple integration that allows you to pin Slack conversations to tables.
Here’s how it works. If someone asks a great question, or posts a great answer about a piece of data in Slack, you can open a thread and tag the Stemma bot:
`@Stemma schema_name.table_name`
(Note: if schema_name.table_name is not unique in your warehouse, you can always use db_name.schema_name.table_name)
You can now go to the Table Usage page in the Stemma UI to see the linked conversations:
Where can you use the Slack bot?
You can use it in public or private channels. You just need to invite the bot into the channel. It cannot be used in direct messages.
Some other things to keep in mind about the integration:
You must respond in a thread
You need to give a precise definition. If a table is only present in one schema, “@Stemma table_name” will work just fine. But if it’s present in multiple schemas, you’ll need to use “@Stemma schema.table_name”
If you have any feedback or questions about the Slack integration, please reach out to [email protected]!
Getting Started
What is Stemma?
Stemma is a data catalog built for users of the modern data stack. It uses automation and workflow integration to help data owners get documentation into the catalog, while making it easy for end users to discover and trust data.
Tools for the Data Team to Document Data
Auto-generated metadata handles basic stewardship so no asset starts blank. Data that is automatically generated includes:
Column descriptions “autodescribed” via upstream column-level lineage
Integrating Your Applications with Stemma: What We Need from You
What we need to get you up and running depends on the applications you will be integrating with Stemma, but in general you will need to provide the information and do the tasks for Slack, SSO, and your dashboard and storage applications.
Once you have integrated your applications into Stemma and everything is tested and working, you need to plan the rollout to your users.
Athena Integration
Stemma’s Athena integration supports: extracting metadata information (tables, columns, etc)
Proceed was follows to create an Athena connection to Stemma.
Create a user
To integrate with Stemma we recommend creating a new user in IAM that has access specifically scoped to the actions and resources required. The following script enables the required access:
Terraform
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 3.0"
}
}
}
provider "aws" {
# You are only creating IAM which are global, you may
# change the region but it should not impact this script
region = "us-east-1"
}
resource "aws_iam_user" "stemma_read_user" {
name = "stemma-read-user"
}
resource "aws_iam_user_policy" "stemma_read_policy" {
name = "stemma-read-policy"
user = aws_iam_user.stemma_read_user.name
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"athena:GetDatabase",
"athena:ListDataCatalogs",
"athena:GetDataCatalog",
"athena:ListDatabases",
"athena:ListTableMetadata",
"glue:GetDatabases",
"glue:GetTables"
]
Effect = "Allow"
Resource = "*"
},
]
})
}
Generate credentials
After creating the user, generate an AWS Access Key and Secret for the user.
Provide credentials to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose AWS as the Connection type, provide a Connection Name, choose Access Key as the Authentication type, provide the AWS Access Key and AWS Access Key Secret you have just created, and specify the AWS Region:
When you are finished, click Save Connection.
Testing the access
Optionally, you can validate the access to the catalog, databases and table that Stemma will have with the following code snippet; make sure to replace ACCESS_KEY and ACCESS_KEY_SECRET with the values for the Stemma Athena user:
import json
import boto3
from botocore.config import Config
client = boto3.client(
"athena",
aws_access_key_id="{ACCESS_KEY}",
aws_secret_access_key="{ACCESS_KEY_SECRET}",
config=Config(region_name="us-east-1"),
)
all_tables = []
for catalog in [cat['CatalogName'] for cat in client.list_data_catalogs()['DataCatalogsSummary']]:
for database in [database["Name"] for database in client.list_databases(CatalogName=catalog)['DatabaseList']]:
for table_name in [tbl['Name'] for tbl in client.list_table_metadata(CatalogName=catalog, DatabaseName=database)['TableMetadataList']]:
all_tables.append(f'{catalog}.{database}.{table_name}')
print('All tables available:')
print(json.dumps(all_tables, indent=2))
On the Azure Active Directory page, click Enterprise applications.
Click New application on the top menu.
Click Create your own application.
Enter a name for your application, we suggest “Stemma Data Catalog“.
Select Register an application to integrate with Azure AD (App you’re developing).
Click Create. This should redirect you to Register an application.
Under Supported account types, select Accounts in this organizational directory only (Single tenant).
Under Redirect URI (optional), select Web. If you are running a hosted version of Stemma, the redirect URI should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name. In case of on-premise deployment, please consult your Stemma representative for this.
Click Register.
Create a client secret
To create a client secret, follow these steps:
Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
Select your newly added application. Make a note of the Application ID.
Click Certificates & secrets on the left-hand side and click New client secret.
Provide a description and an expiration length that follows your security organization’s guidelines.
Click Add.
Copy the Value corresponding to the newly-created client secret and client secret expiration. This is the Relying Party OAuth Client Secret. This secret value is never displayed again after you leave this page.
Set up application permissions
You need to set up two application permissions in your Azure portal so that you can use all of the Azure AD synchronization options in Stemma. You need to set up the following permissions:
Azure Active Directory Graph Directory.Read.All
Microsoft Graph Directory.Read.All
To set up permissions, do as follows:
Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
Select your newly added application.
Click API permissions on the left-hand side and click Add a permission.
Click APIs my organization uses and click Windows Azure Active Directory.
Select the Azure Active Directory Graph > Directory.Read.All permission. a. Click Application permissions. b. Under Directory, click Directory.Read.All. c. Click Add permissions.
Select the Microsoft Graph > Directory.Read.All permission. a. Click Application permissions. b. Under Request API Permission, click Microsoft Graph. c. Under What type of permissions does your application require?, click Applicationpermissions. d. Under Directory, click Directory.Read.All. e. Click Add permissions.
Click Grant admin consent for and then click Yes. You should see a message saying that you’ve granted consent for each of your permissions.
What We Need From You
Provide Stemma with the following information by email or Slack:
Application ID
Client Secret
Discovery Document Endpoint (Well Known OpenID Configurations)
Tenant ID – The ID of the tenant you created the application in.
BigQuery Integration
Stemma extracts table and column metadata including database, schema, table name, table description, column name, and column description from a Bigquery database.
What We Need From You
You will need to create a service account for reading metadata and grant it “BigQuery Metadata Viewer” access to all of your datasets. This can all be done via the BigQUery UI.
Next, collect the information Stemma needs, as follows:
Authorization/Access Scopes: Stemma uses the following scopes to extract metadata and usage information. — <https://www.googleapis.com/auth/bigquery.readonly> — <https://www.googleapis.com/auth/cloud-platform>
A JSON Key File: Proceed as follows to create the JSON key file for a service account.
In the Service account name field, enter a name. The Cloud Console fills in the Service account ID field based on this name.
In the Service account description field, enter a description. For example, `Service account for Stemma BigQuery Integration
Click Create and continue.
Click the Select a role field, and make sure to select the following: — BigQuery Data Viewer (roles/bigquery.dataViewer) – to extract the metadata, watermarks and usage information. Popular tables will be generated using this information. — BigQuery User (roles/bigquery.user) – Needed to map the user’s information with the Tables i.e., Frequent Users, Table Owners, etc. — Private Logs Viewer (roles/logging.privateLogViewer) – Needed for enriching table usage.
Click Continue, and click Done to finish creating the service account.
Do not close your browser window. You will use it in the next step.
Create a service account Key:
In the Cloud Console, click the email address for the service account that you created.
Click Keys, click Add key, then click Create new key.
Click Create. A JSON key file is downloaded to your computer.
Click Close.
Provide the information to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose Google Cloud as the Connection type, provide a Connection Name, and paste in the Service Account Key from the JSON file you have just downloaded.
When you are finished, click Save Connection.
dbt integration
Stemma integrates with your dbt Cloud or Core projects and is able to ingest and keep up to date all of the metadata in the table below.
Integrating with dbt Cloud requires minimal information and access to the dbt Cloud account of your organization and an administrative access to Stemma.
The following 3 properties are needed from dbt Cloud:
Account ID
Usually visible within the browser URL on the account settings page: https://cloud.getdbt.com/next/settings/accounts/<account-id-number> or within any any dbt Cloud project: https://cloud.getdbt.com/next/deploy/<account-id-number>/projects/<project-id>.
For organizations on the Team Plan, the account wide permissions required are Metadata and Read-Only.
For organizations on the Entreprise Plan, the necessary permission set is Account Viewer and Job Viewer for all projects that should have metadata ingestion.
Host
Unless organization specific customization has taken place, the host is likely cloud.getdbt.com
With these credentials in hand, all that’s left is adding the connection information to Stemma:
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection.
In the pop-up window, choose dbt as the Connection type, provide a Connection Name, and fill in the API Token, Host and Account ID collected from dbt Cloud.
When you are finished, click Save Connection
dbt Cloud integration may take up to 24 hours for initial ingestion.
The S3 prefix path will be “dbt/{date}” where “{date}” is the current days date in UTC in the format “YYYY-MM-DD”. For example, for a file delivery on 04/01/2022 we would expect to see the following files:
Proceed as follows to create a Delta Lake connection to Stemma.
Step 1: Generate a Personal Access Token
Follow these instructions to generate the Delta Lake personal access token that Stemma will need to extract your data into the catalog.
Step 2: Provide the Personal Access Token and Hostname to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose Delta Lake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows
Provide the Personal Access Token you created in Step 1
Provide the fully-qualified domain name of the Delta Lake host; for example mydlhost.mydomain.com.
When you are finished, click Save Connection
Google OIDC Integration
This document walks you through the steps to create and configure a Google OIDC application that Stemma can use to authenticate users.
Note:You must be a Google administrator to do this.
Steps
Log in to Google Developer Console
On the left, select Credentials
Select + CREATE CREDENTIALS at the top
Select “OAuth Client ID”
Enter the following information:
Application type: Web application
Name: You can choose the name; we suggest “Stemma OIDC” as a descriptive example
Download the Client ID and Client Secret JSON file and provide them to Stemma by email or Slack.
Hive Integration
What we need from you
Stemma needs certain information and credentials to extract Hive metadata into the catalog. Contact Stemma by email or Slack, and provide the following:
Metastore Host: Hostname is the IP address of the Hive server to which you are connecting.
Username and Password: You will need to provide a username and password for Stemma to use to access the Hive schema.
Metastore Port: Server port used for accessing metadata about hive tables and partitions. The default Hive metastore port is 9083.
List of Databases: Stemma whitelists databases, and so you will need to provide a list of the databases we will be importing.
Metadata extracted
The metadata Stemma extracts includes:
TBLS – stores basic information about Hive tables, views, and index tables.
DBS – stores the basic information of all databases in Hive.
PARTITION_KEYS – the field information of the table storage partition.
TABLE_PARAMS – stores the attribute information of the table/view.
SDS – saves the basic information of file storage, such as INPUT_FORMAT, OUTPUT_FORMAT, whether it is compressed or not.
COLUMNS_V2 – stores the field information corresponding to the table.
PARTITIONS – stores the basic information of table partitions.
JIRA Integration
This document walks you through the steps you need to take to integrate Stemma with your JIRA installation. You must be a JIRA Admin to do this.
Steps
Step 1 – Collect information about your JIRA environment
Collect the following information to provide to Stemma (see Step 3):
The URL for your JIRA environment (e.g., <https://jira.net>)
The project ID for the JIRA project you would like Stemma to create tickets in See this page for guidance from Atlassian on how to obtain the project ID for a JIRA project
If using JIRA Server or Cloud, the ID for the JIRA Bug issue type
Note: Skip this step if using JIRA on-premise
See this page for guidance from Atlassian on how to obtain the ID for the Bug issue type
Optional: The name of one or more labels you would like set on tickets created by Stemma
Step 2 – Create a service account
Stemma will need an email and API token to authenticate to your JIRA environment. Our recommendation is to create a JIRA service account so that the authentication is not tied to an actual user’s account. a. Go to Administration > User management > Users b. Click Create User and fill out the required information c. Create an API token from the service account
For JIRA on-premise this feature is not currently supported so you will provide the password instead
d. Grant permission to the service account
The service account will minimally need access to create and view tickets in the JIRA project you intend to use
You can either grant permissions directly to the service account or you can grant the service account a role
Stemma needs:
The email address of the service account
The API token or password
See Step 3.
Step 3 – Provide the information to Stemma
Provide the information from Steps 1 and 2 to Stemma by email or Slack.
Linking JIRA Issues to Stemma
To link a JIRA issue to a table in the Stemma catalog, you need to include a Table Key tag in the issue description. The tag format is Table Key: {SOURCE}://{DATABASE}.{SCHEMA}/{TABLE} [PLEASE DO NOT REMOVE].
Looker Integration
This document walks through the steps required to create a service account in your Looker, which will be connected to your Stemma Instance for dashboard metadata extraction.
Stemma uses the Looker APIs (via Looker Python SDK) to extract the metadata of Folders, Dashboard, Dashboard Elements and Explore Queries, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Looker. Read Looker’s docs: https://docs.looker.com/reference/api-and-integration
Stemma takes leverage of the API3 keys (Client ID & Client Secret) to make a secure connection between your Stemma and Looker instances. Below you can find out the steps required to create the new API3 keys for this integration.
What is extracted?
Stemma extracts information about Folders, Dashboard, Dashboard Elements, Explore Queries, and data sources associated with them.
Stemma will not extract Dashboards that are deleted, hidden or in a user’s personal folder.
Steps
Step 1: Make sure you have a Looker Admin role
Looker → Stemma integration requires an Admin role to fetch the correct data and build the relationships between the entities like Table Lineage, Dashboard Owners, etc.
So, as a first step, make sure you have an Admin role.
Step 2: Generate API3 keys
The next step is to create an API3 Key. From the Admin menu, navigate to the Users page.
Find the user you want to use to generate the API3 key, and click the “Edit” button under the Actions column.
Once you are on the “Edit User” page, find the API3 Keys settings, and click “Edit Keys“
Click “New API3 Key” to generate a new Client ID and Client Secret.
Step 3: Provide the details to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Looker as the Connection type, provide a Connection Name, and then fill out the fields in the dialog box as follows:
Base URL Example: <https://stemma.cloud.looker.com/>
The Stemma Mode integration accesses the Mode Discovery API. This API is distinct from regular Mode API: it is much more performant but only contains a subset of the available endpoints. This means Stemma is able to access only metadata about reports, not the report runs/results. Stemma also accesses the regular REST API to augment the information retrieved from the Discovery API. In all, there are four values required by Stemma:
Your organization name in Mode
The access token to access the REST API
The password token to access the REST API
A signature token to access the Discovery API
Optionally, if you would like Stemma to only extract metadata from specific Mode spaces you may also provide Stemma a list of the spaces to include.
The access token and password token allow Stemma to augment the information within the Discovery API with the regular REST API to unlock two additional capabilities:
The ability to link your dashboards to the underlying tables in your warehouse which are used to serve the dashboards.
Stemma will use the general Mode API for datasources to retrieve your data source information to find out the name of the database (you may use, PUBLIC, mktg, etc.) as well as the type of data source (e.g. Snowflake, Redshift, etc.) in order to apply the proper SQL parsing semantics.
The ability to rank your dashboards by the number of times the dashboard has been viewed in the past 90 days.
Stemma will use the general Mode API for reports API to retrieve an accurate view count.
Creating the Signature Token
To authenticate with the Mode Discovery API a signature token must be created. This signature token can be created by running a curl command using the account’s REST API access token and password secret. The following code is a snippet from Mode that has been slightly adjusted to create a signature token that Stemma can use. Make sure to replace {organization} , {your-access-token} and {your-access-secret} with the appropriate values (organization can be found in the URL when you visit Mode, e.g. https://app.mode.com/<organization>/spaces). Visit the Mode documentation for information on creating the api-key and api-secret.
Please note, if this does not work it likely means you do not have permissions. Please contact Mode support with help creating a read-only signature token.
APIs used
APIs are listed in the table below. In the table below, Data Extracted is the data we actually store from the API (and may change as we add functionality). Risks highlights any potentially-sensitive information in the API (whether or not we actually store it).
Please note, that the “Report Stats” endpoint is not enabled by default in mode, so functionality coming out of it may not be available, until the endpoint is enabled.
Low: contains access to information about the data source. Stemma should separately have access to this information but this API access is required to link a mode dashboard to a specific database.
Notes
Here are some additional notes about possibly sensitive information we may receive from Mode
Dashboard execution times are non sensitive on their own, although it is possible to infer NPMI from who is running which dashboard queries at precise times. Stemma only displays aggregated forms of this information, but stores it disaggregated.
Query names may contain NPMI (e.g. a query titled “EMEA Expansion Potential”, when the company hasn’t announced its intention to expand into EMEA).
Raw SQL queries may contain sensitive information (PII), typically in the form of filter clauses. This is generally not allowed under most security policies especially under Open Collections, however, it is a possibility. By default, these SQL statements will be shown to users in Stemma.
Configuring Your Stemma Connection
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, optionally add a name for the connection, and choose Mode from the Connection Type drop-down menu.
This document walks you through the steps to create and configure an Okta application that Stemma can use to authenticate users.
Note: You must be an Okta administrator to do this.
Steps
Log into Okta admin portal
In the left frame, select Applications :
Select Create App Integration:
Select the options for OIDC – OpenID Connect and Web Application.
In the New Web App Integration, enter the following values:
App Integration Name: Your app name, we suggest Stemma Data Catalog
Grant Type: select the following –
Authorization code
Client Credentials (Client acting on behalf of itself)
Refresh token
Sign-in redirect URI:
Use the fully qualified hostname and add the endpoint:
/auth
If you are running a hosted version of Stemma, this should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name
Sign-out redirect URI This should be the base Stemma URL. Example:
https://<customer>.stemma.ai
Trusted Origin:
If you are self-hosting Okta, provide the Base URI for your Okta Sign domain, otherwise, leave this blank
Assignments:
Select the assignment access you would like within your organization
Stemma will require access by certain APIs to retrieve user information from your organization’s directory.
Navigate to Okta API Scopes :
Make sure to grant permission for the following scopes:
okta.users.read
okta.users.read.self
Provide the following values to Stemma by email or Slack so that authentication can be built into your Stemma deployment:
Okta Discovery endpoint (well-known configs)
Client ID
Client Secret
Redshift Integration
What we need from you
Stemma needs certain information and credentials to extract Redshift metadata into the catalog. To provide these, proceed as follows.
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Redshift as the Connection type and then provide the following information in the dialog box:
A username and password for Stemma to use to access the INFORMATION_SCHEMA
The hostname of the Amazon Redshift cluster. You can find it in the Properties tab of your Redshift Cluster Details page. Under the Connection details section, copy the Endpoint up to but not including the colon.
A port number. The default port number for Redshift is 5439. From your Redshift Cluster Details page, you can find the port number under the Database configurations section of the Properties tab.
A list of databases: Stemma whitelists databases, so you will need to provide a list of the databases we will be importing.
Access
The Stemma Redshift integration requires access to two pieces of information within the Redshift install:
information_schema metadata
Query logs
information_schema
The specific tables and views we access include:
PG_GET_LATE_BINDING_VIEW_COLS
svv_external_columns
COLUMNS
Because of the way permissions work in Redshift, these tables are only populated with data to which the user has SELECT access. If you don’t want to use column stats, and therefore want to restrict access for this credential, Stemma will give you the SQL code required to create a materialized view. You create that materialized view as a superuser, and grant our credential the ability to read and refresh that materialized view.
Query logs
Stemma Redshift requires query logs. We can extract these in two ways: push or pull.
Push
The recommended integration involves configuring Redshift to dump user activity logs into S3.
There are multiple methods to achieve this, either using CloudTrail or directly from Redshift. the exact right way differs between organizations (many already have audit logging configured for other purposes).
Pull
Stemma can extract logs from STL_QUERY. However, because this option requires the credentials be a superuser, it is not the recommended integration method.
Column stats – Optional
The Stemma integration user must have full SELECT access to any tables or views for which you want to see column stats.
Slack Integration
This document walks through the steps required to install a Stemma Slack application in your workspace.
Prerequisite
Stemma does not publish the Stemma Slack application to the Slack Marketplace. To install this application for your workspace, you need to create a new application using an application manifest file.
If you have an application manifest file, make sure you have it handy before you begin. If you don’t have a manifest file, you can copy and paste the following into a file:
_metadata:
major_version: 1
minor_version: 1
display_information:
name: Stemma
description: Help you enrich your metadata directly via Slack
background_color: "#0a1930"
features:
bot_user:
display_name: Stemma
always_online: true
slash_commands:
- command: /stemma
url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/slash/help
## Replace {SUBDOMAIN} with your Stemma subdomain.
description: Learn how StemmaBot works
usage_hint: help
should_escape: false
oauth_config:
scopes:
bot:
- app_mentions:read
- channels:history
- channels:read
- chat:write
- commands
- groups:history
- groups:read
- links:read
- links:write
- users:read
- users:read.email
settings:
event_subscriptions:
request_url: https://{SUBDOMAIN}.stemma.ai/stemma/slack/events
## Replace {SUBDOMAIN} with your Stemma subdomain.
bot_events:
- app_mention
- link_shared
- member_joined_channel
- message.channels
- message.groups
- team_join
interactivity:
is_enabled: true
request_url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/interactivity
## Replace {SUBDOMAIN} with your Stemma subdomain.
org_deploy_enabled: false
socket_mode_enabled: false
token_rotation_enabled: false
Save the file with any prefix and in any location you like, but make sure:
The filename has the suffix .yaml.
You replace {SUBDOMAIN} with your Stemma subdomain.
Steps
Step 1: Verify your application manifest file
Make sure that at least the following properties in the application manifest file match your Stemma endpoint:
features → slash_commands → url:
settings → event_subscriptions → request_url
settings → interactivity → request_url
Step 2: Create and install your Stemma Slack application
Create a new application using the application manifest file.
b. You will be prompted to select a method to create a Slack Application. Select From an app manifest, and choose the workspace in which you want to install the Stemma Slack application.
c. Paste in your manifest file. Make sure you select the YAML format, and paste in the contents of the file (see Prerequisite above).
d. Click Next and verify the scopes and events the Stemma Bot is requesting. Finally, click Create to install the application to your workspace.
Step 3: Provide Signing Secret and Bot Token to Stemma
To secure the communication between Slack and Stemma, you need to provide Stemma with the Signing Secret and Bot User OAuth Token of your newly installed Stemma Slack application:
Signing Secret:
You can find the Signing Secret on the Basic Information page, in the App Credentials section.
Bot User OAuth Token:
You can find this token on the OAuth & Permissions page, in the OAuth Tokens for Your Workspace section.
Now provide the Signing Secret and Bot User OAuth Token to Stemma:
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Slack as the Connection type, provide a Connection Name, and then in the dialog box enter the Bot Token and Signing Secret.
When you are finished, click Save Connection.
Step 4: Receive confirmation from Stemma
Stemma will use the information you provide to integrate your Slack workspace with your Stemma Instance. Once you get the confirmation, you should be able to type /Stemma help in Slack to see usage help for the Stemma Slack application.
Step 5: Make sure to add an app icon 😊
Under the Stemma (Slack Application) basic information Tab, add the Stemma icon in the “Display Information” section. Users will be much more likely to use the bot if it’s easy to identify.
Right-click and “Save as…” and upload this icon to use as the Slack App icon
(Nice to have) Please invite Stemma team member(s) to a shared test channel
At this point, you have the Stemma Slack application installed within your Slack workspace. In order to verify if things are working just fine with no errors, you can create a temporary shared channel and invite a couple of members from the Stemma team to that channel.
Stemma needs certain information and credentials to extract Snowflake metadata into the catalog. To provide these, proceed as follows.
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Snowflake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows:
Credentials We support authentication with a username and either a password or a private key.
Username and Password: You need to provide a username and password for Stemma to use to access the INFORMATION_SCHEMA as described above.
Username and Private Key: If you set up authentication via key pairs you will need to provide Stemma with a private key to use when accessing INFORMATION_SCHEMA as described above. If your private key is encrypted, you will also need to provide Stemma with the passphrase.
Snowflake Account Name: If you use the snowflake web interface, the URL you use should look something like abc.snowflakecomputing.com. Everything that appears before Snowflake computing is your account name.
Warehouse: The name of the warehouse (also called Virtual Warehouse) Stemma will connect to.
Role: The role Stemma will use once connected. If Stemma is extracting query logs, this role will need to have MONITOR access on the warehouse.
When you are finished, click Save Connection.
Access
The Stemma Snowflake integration requires read access to all tables that will be included in the catalog, plus query logs.
Metadata
The specific tables and views we access include:
TABLES
COLUMNS
Due to the permissions model of Snowflake, these tables are only populated with data to which the role has SELECT access. Metadata will be extracted one database at a time since each Snowflake database has its own INFORMATION_SCHEMA.
Query logs
Stemma extract logs from the QUERY_HISTORY table function in the INFORMATION_SCHEMA. Logs are used to inform lineage, usage, common usage patterns and more. Due to the permissions model of Snowflake, this table is only populated with queries executed in virtual warehouses that the role has MONITOR access on.
Access History
Stemma uses ACCESS_HISTORY to enrich lineage derived from query logs. To grant Stemma access, Stemma’s user role will need to have the IMPORTED PRIVILEGES:
grant imported privileges on database snowflake to role stemma_role;
If Stemma cannot read Access History, lineage will still work however accuracy will be reduced.
Note: If you try giving the Stemma role access only to the required view, without giving access to the SNOWFLAKE database and ACCOUNT_USAGE schema before that, Snowflake will probably reject your request with the “Grant not executed: Insufficient privileges.” error.
Tableau Integration
Usage information
The Stemma Tableau integration will sum view counts for all workbooks in a particular dashboard over all time.
Installation Instructions
This document walks through the steps required to create a service account in your Tableau, which will be connected to your Stemma Instance for metadata extraction.
Stemma uses the Tableau Metadata API (GraphQL) to extract the metadata of projects, workbooks, dashboards, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Tableau. Read Tableau’s docs: https://help.tableau.com/current/api/metadata_api/en-us/index.html
Connection Type:
We support both Online and Extract Connections with no differences.
Steps
Step 1: Decide your authentication strategy
The Tableau Metadata API is authenticated using a Personal Access Token (PAT) attached to a specific user within your organization. You will always share with us the PAT, not the username/password of the user itself. The PAT has the same permissions as the user to which it is attached.
Most users simply will simply create a PAT attached to their admin account. This is the easiest to set up and simplest to manage longer term, and we recommend it. For this option, skip to Step 4.
However, because this means the PAT has write access to the instance and Stemma doesn’t use write access at all, this doesn’t strictly follow principle of least privilege. In cases where restricting such access is critical, we recommend creating a read-only service account, and then you’ll generate a PAT for that limited account. Continue to Step 2 for this option.
Step 2: Create a new Viewer user
If you decide to create a new user for this integration (otherwise skip to Step 4 to use your current user to generate the Personal Access Token), navigate to the Users menu from the side navigation.
Click on “Add Users” and select “Add Users by Email“. A popup will appear with a Form to create new Users.
Select the “Tableau” radio button, and write down the email address you want. You will not share this email address with Stemma, and use this to sign in to the Tableau instance in order to create a PAT.
Email Tip: If you use Google email, [your_current_user][email protected][your-domain] will allow you to receive emails on your current email address and satisfy the uniqueness requirement.
Select “Viewer” as the Site role, which will give this user minimum access to the Tableau, and click “Add Users” to finalize the user creation.
Step 3: Set up permissions for the user
The next step is to make sure the user has permission to read your externally connected data sources. This will be used to link the workbooks with the tables in Stemma. By default, a “Viewer” user will not have access to any data source.
Click on the “External Assets” from the side navigation, click on “Select All“. A new dropdown will appear. Click on the “Actions” dropdown and select “Permissions“.
A Permissions dialog will appear under the “Permission Rules” section. By default, you will be on the “Database” tab.
Click “+ Add Group/User Rule” and select the user you created in Step 2 of this document, set the Template “View” and Save the rule.
Click on the “Tables” tab, and repeat the steps to add the rule for the user with the “View” template.
Stemma uses information about the upstream data sources in order to automatically create data lineage from your dashboard to your upstream table. In order to enable this, Stemma needs access to read the data source metadata. To enable the Stemma user permissions required by the Tableau Metadata API, you will need to provide “View” and “Connect“” access to each data source and workbook that Stemma will read from.
Step 4: Create a Personal Access Token
Log in to your Tableau instance with the user you want to use for Stemma integration. If you created a new user for this integration, make sure you log in as that user.
Select My Account Settings from the dropdown of your profile.
Scroll down to the “Personal Access Tokens” section, enter the token name “stemmaViewer” and click “Create new token”.
A new pop-up will appear with the personal access token and secret. Make sure you copy these somewhere safe as they will not be displayed again.
Step 5: Provide the details to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose Tableau as the Connection type, provide a Connection Name, and then in the dialog box fill out the fields as follows:
Base URL Example: <https://sample.online.tableau.com>
Optional: You can opt to exclude or include some items by providing the following to Stemma by email or Slack:
Name of Projects to Exclude: If you want to exclude any sample projects or test projects from Stemma, provide the names of those projects.
External Assets Types to Include (Connection type):This is uncommon, but if you want to include and link only certain types of external assets with your Tableau workbooks in Stemma, provide the names of those assets you want to include and link. Note that this may limit the user experience of Stemma, and is not recommended.
Connecting to AWS S3
Stemma supports two methods of transferring files:.
Stemma creates an S3 bucket: you provide Stemma the ARN for a role/group that should have access to read to and write from the bucket. You then provide the role/group to the users or resources that need access to the bucket. Stemma recommends this method.
You create the S3 bucket: Stemma provides you the ARN of the role that the Stemma app uses. You will need to add the following policy to the bucket to allow Stemma access. Make sure you replace <BUCKET_NAME> and <ARN_FROM_STEMMA> with the appropriate values.
Every Stemma rollout is different, depending on your organization’s particular needs, your data storage and management applications, and the people who use them in various ways and with varying priorities. So the following are guidelines, not rules, but they are based on the accumulated experience of Stemma’s founders and engineers, and are a good place for you to start.
Criteria for Success
The two criteria that matter in measuring the success of a data catalog are
adoption; and
customer satisfaction
If your rollout achieves both of these in a high degree, you can declare victory in the knowledge that you have improved the lives– and in particular the productivity, efficiency, and awareness– of your users and are contributing significantly to your organization’s success.
The following broad steps are designed to help you get there.
Steps for Rollout and Adoption
Step 1. Start with a persona and its use-cases
There are many user personas and use cases for a data catalog. Here’s a simplified view of the most common ones. It’s less important which persona you start with than that you choose a specific group of alpha users.
Step 2. Launch in phases
Phase 1: Identify a small set of tables to get alpha user feedback on.
This set can be your organization’s most commonly used tables (often referred to as “core” tables) or one domain within the company, such as marketing, growth or finance, etc. Core tables are often the best choice, because they have the most impact and there’s often a central data team which is responsible for maintaining them.
Phase 2: Populate MVP metadata on these tables.
Stemma can ingest the majority of this metadata automatically (see What Stemma Enables You To Do) leaving you and your users to curate only certain important items such as ownership, table descriptions and table status .* Where you must, to glean tribal knowledge, it may also help to do a “docs jam session” with a group of data producers and consumers. (You might even offer a reward such as a gift card for those who enter the most documentation!)
Phase 3: Alpha launch to 5-20 alpha users.
From the persona you decided on earlier, choose highly vocal users and those with the most tribal knowledge. These users will become Stemma’s avid supporters when you launch to a broader audience. (Note:Make sure these alpha users have the access they need to do work in Stemma (see SSO integration under What We Need From You) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake
Incorporate feedback and iterate. Look particularly for feedback that indicates opportunities for productive metadata propagation and cross-referencing (for example, someone might say something like, “Oh, we already have this metadata in this spreadsheet — we should pull that in here, too.” )
Tasks your alpha users should become comfortable doing include:
Search – Search for tables and dashboards using a Page-Rank style search
Phase 4: Beta launch to all users of the prioritized persona.
Focus your beta launch on your chosen persona (data consumers, for example). Don’t blur the focus of your launch by opening it up to all personas. Those other personas will come on board at GA, after your alpha and beta users have thoroughly learned and exercised the product and can serve as its advocates and in-house experts.
Best practices for achieving the greatest adoption include:
Update Slack channel headers where people ask each other questions. Stemma’s Slack integration links these conversations to the catalog automatically.
Embed Stemma training in new hire training. Tagging data sets by domain (marketing, growth, etc.) can help new hires quickly onboard to their domains. Showcase Sremma as an entry point into your existing training, for example by having all technical new hires instrument a metric during onboarding.
Create linkages with other products. Create links between various data tools. For example, auto-populate a link between an Airflow DAG that populates a table and the table page in Stemma (and vice-versa). (See Sharing additional metadata). Another productive linkage is between the table page in the data catalog and a link to the code that is used to generate the table.
Showcase the catalog at a group or company meeting. Deliver a short five-minute demo at a meeting that includes your target users. Educate, answer questions, and thank your alpha users — this creates more awareness and provides an opportunity for everyone to learn.
Step 4. Measure success
1. Adoption
Track WAUs (Weekly Active Users).
Start with Weekly Active Users (WAUs) rather than Daily Active Users or Monthly Active Users. Common usage frequency is weekly, not daily or monthly.
Target Penetration rate: 80%. A great penetration rate is 80% of WAUs within your target persona.
2. Customer Satisfaction (CSAT)
Measure CSAT periodically.
Send out a survey every three to six months asking users to rate their experience with Stemma.
Other metrics that organizations often consider include documentation quality, search quality, etc. But it’s best to stick to the above core metrics at the outset. As your Stemma implementation matures and ingests more metadata over time, you can begin to broaden your criteria.
Migrate from existing Amundsen
Migrating from Amundsen involves three main steps:
1 – Migrating existing metadata from Amundsen to Stemma
In this step, you take all the existing metadata from Amundsen and load it into Stemma. The metadata imported includes, but is not limited to:
Table and column descriptions
Tags
Ownership information
To import this metadata, upload a dump of your Amundsen neo4j metadata into Stemma:
Bring the Amundsen storage offline.
Use dump -> load, available on all neo4j installations, to dump and upload your data.
After import, your Stemma instance will be up to date with your Amundsen instance. However, Stemma has a significant number of additional features so the layout will appear different. Check out the quick start page here:
Multi-step column-level lineage on the Lineage tab of table details in Stemma
But your data won’t be periodically updated until you enable periodic updates; see the next step.
2 – Ingesting metadata continuously
You need to configure Stemma to ingest metadata updates on an ongoing basis. Once this configuration is done, you won’t need Python databuilder jobs to ingest data.
Information ingested in this step includes:
Table and column names
Linked issues (JIRA tickets)
Frequent users
Lineage, if it exists
IIn Stemma’s Admin interface, provide credentials to your data sources to allow Stemma access to extract metadata on an ongoing basis:
Stemma will now do the work to “upsert” metadata updates, as they occur, into the data you uploaded from Amundsen in the previous step.
3 – Cut Over
At this point, Amundsen and Stemma are running side-by-side, and Stemma recommends sharing access to Stemma with your power users and getting their feedback to ensure all of their use-cases are supported. Usually one week of overlap to obtain feedback and an additional week to incorporate feedback, if applicable, is sufficient.
When all looks good, simply redirect the URL of your internal Amundsen to the Stemma URL.
If you have any feedback or questions about moving from Amundsen to Stemma integration, please reach out to [email protected]!
Stemma’s goal is to capture information about data no matter where it happens. Slack is increasingly where conversations about data exist, and Stemma has a super simple integration that allows you to pin Slack conversations to tables.
Here’s how it works. If someone asks a great question, or posts a great answer about a piece of data in Slack, you can open a thread and tag the Stemma bot:
`@Stemma schema_name.table_name`
(Note: if schema_name.table_name is not unique in your warehouse, you can always use db_name.schema_name.table_name)
You can now go to the Table Usage page in the Stemma UI to see the linked conversations:
Where can you use the Slack bot?
You can use it in public or private channels. You just need to invite the bot into the channel. It cannot be used in direct messages.
Some other things to keep in mind about the integration:
You must respond in a thread
You need to give a precise definition. If a table is only present in one schema, “@Stemma table_name” will work just fine. But if it’s present in multiple schemas, you’ll need to use “@Stemma schema.table_name”
If you have any feedback or questions about the Slack integration, please reach out to [email protected]!
Messaging Users
Stemma provides a quick and easy way to send messages to specific groups of users on the basis of their relationship to a table (e.g. are they an owner in Stemma, have they queried the table in the prior three months, etc.). If you need to announce a future change, for example, or notify users when significant events occur, such as when a table has been deprecated, you can use Stemma messaging to ensure that the affected users will receive the notification.
You can compose a custom message to users and teams related to a table by clicking the Message button on the top-right corner of the Table detail page
There are three types of notifications you can send in Stemma, depending on the receivers and your personal preferences. If a member of a related group is a user, they can receive either email notifications or Slack private messages. If a member of a related group is a team (Slack channel), they will receive a message in Slack.
Slack notifications
After you have completed our Slack Integration, you can send and receive messages from the Stemma bot.
Messages in channels
If a member of a related group is a team, they will receive a message from the Stemma bot in the corresponding Slack channel.
📘 Note
Make sure you add the bot to all channels that need to receive messages.
Slack private messages
You can choose to receive notifications from Stemma as Slack private messages. To receive messages from the bot, you need to add the Stemma application to your Slack configuration.
To modify your preferred notification method, contact your Stemma representative.
User groups
Each group contains a list of users and teams (Slack channels). In Stemma there are four different types of user groups related to a table:
Stemma will only show the options that apply to the given asset. For example, if a user is sending a custom message for a table that does not have any downstream lineage there will not be an option for Owners of derived tables.
Using Tableau with Stemma
Stemma works with Tableau to help answer common questions data analysts have, such as:
Stemma indexes Tableau dashboards along with their underlying tables, providing a one-stop search for all data assets. Stemma also tracks how many times a dashboard has been viewed, instantly highlighting which is most commonly used.
Is the right data being used?
Stemma automatically correlates your Tableau dashboards to the corresponding table asset by reconciling the table names extracted from the Tableau Metadata API with the existing table metadata Stemma has captured from your warehouse. This allows you to start your exploration by finding trustworthy reports and, in turn, using the lineage to discover which tables are useful to you.
You can also explore the dashboard’s lineage upstream to get an end-to-end view of where the data came from.
How can I use the metrics in this dashboard?
Tableau provides Stemma with the SQL that it creates to generate each sheet, and Stemma indexes this SQL so you can easily find the code. You can copy and paste the code into your IDE to jump-start the SQL exploration.
Best Practices for Data Ownership
👍 The three types of ownership that work well
It works well to break down ownership of the most important date into three distinct roles:
Delivery owner — Ensures that this particular data gets delivered on time, on a promised delivery SLA. Usually, this is a data engineer or analytics engineer responsible for developing and maintaining the pipeline that produces this data.
Domain owner — What does this particular value in a field (or column) mean? When does this particular event get triggered? Usually, this is a product engineer who created the event or the analysts and data scientists who use this data most frequently and understand the physical reality that the data represents.
Policy owner — Ensures that this data gets used according to the classification and terms associated with it. Sometimes you acquire data from a source that should not be used for a certain category of use-cases. For example, YouTube is permitted to show ads to kids, but not permitted to personalize them. Therefore, the personalization data can’t be used if the subject is a child. The person making these calls is usually not an engineer or data scientist, but someone on the Policy or Privacy team at the company.
👎 Types of ownership that often don’t work as well
Overall Quality owner: In practice it’s often hard to find people able and willing to take ownership of the end-to-end quality of a data set. This is because data engineers don’t consider themselves owners of the data that’s produced from the upstream application and don’t want to be responsible for hunting down a website bug that impacts the data in the warehouse. Product engineers, on the other hand, don’t have enough context about how data gets joined and transformed downstream to own the final derived data artifact. This may change as decentralized data management (or data mesh) is deployed more broadly.
Shared ownership: This often does not work in practice, though it can work well if the owner group shares a good understanding of the different types of ownership and the group can efficiently redirect each question to the right person.
🏥 The data engineer’s role
In practice, the data engineer plays the role that a triage nurse would play in an Emergency Room. When an issue arises (like a patient arriving in the ER), the data engineer triages to see what’s going on. Sometimes it’s a problem the engineer can fix), so they fix it and resolve the issue (like the nurse treating an injury on the spot). In other cases, the engineer redirect the issue to the appropriate owner (as in a referral to a doctor or other health practitioner). In short, the issue determines who the owner should be.
Using Stemma
Use Stemma to assign and document ownership. Divide ownership as follows:
Stemma’s advanced search helps you quickly narrow your criteria and find the data assets you want to see:
Search bar: The search bar allows you to enter free-form text to search across nearly every piece of metadata in your entire data ecosystem. This includes table names, descriptions, column names, column descriptions and more.
Filtering: You can use the following powerful filters, separately or together, to narrow your search to find the assets you need:
Resource filtering: allows you to choose what type of resource to return in the search results; options are Tables, Dashboards, and Users. By default, Tables and Dashboards are selected.
Asset status filtering: filters results to only include the relevant statuses. The options are Certified, No Status, Deprecated, and Intermediate. The default options are Certified and No Status.
Table filters: these apply only to the Tables results. If you have another resource type besides Tables selected in your resource filter, the values used in the table filters will not affect those other results. Table filters allow you to filter results by the data source, database, schema and table name. You can also filter by column; this will return any table that contains that column name.
Dashboard filters: these apply only to the Dashboard resource results. If you have another resource types besides Dashboard selected in your resource filter, the value used in the dashboard filters will not affect those other results. Dashboard filters can be used on the product, folder or dashboard name.
Table & Dashboard filters: These can be applied to both Tables and Dashboards resource types. You can filter both Tables and Dashboards by the tag that is associated with them.
Advanced filtering
By default, all filters look for an exact match for the value you provide. But the Table, Dashboard and Table & Dashboard filters allow for advanced wildcard searches.
To use a wildcard search, you can place one or more asterisks * in the filter box. Here are a few examples:
Filter to include only schemas that start with marketing_
Filter to only include schemas that end in _prod
Filter to only include schemas that have _user_ in them
You can also combine multiple filters together!
For example, you might want to find all tables in the marketing schema that begin with marketing and end with raw–but only tables that include the column order_id.
Understanding the Results
Stemma’s advanced search naturally ranks your search results based on their actual usage. This means the more users run queries on a table the higher the rank that table is likely to have in your search results. In addition, the result list shows you additional reasons why your assets were returned.
This example shows the search results for the search term “orders”:
Points to note:
Any words in the table or dashboard description that match the search term are highlighted
Any column names that have an exact match or a similar match are listed to the right
Any column description that matches the search term is displayed on the right
Assigning Ownership of Tables
You can assign ownership of a single table to individuals or teams. You can also assign owners to a set of tables in bulk, adding all the owners in a single transaction.
Note: You can assign owners to tables in Stemma, but not to dashboards. Stemma ingests ownership of dashboards from the underlying BI tool(s) and you need to make updates and modifications in those tools.
The importance of ownership
Ownership is one of the fundamental pillars of a data catalog. It defines responsibility by individual users and teams, adding tremendous value to the data in the catalog, including:
Delegation of duties: Which users are responsible for keeping this table properly documented?
Compliance with processes and procedures: Has the right team updated the documentation for this table before the newest pull request is merged?
Reporting and analytics: Which teams have the highest percentage of their assets documented? The lowest?
You can assign ownership of a table either to individuals or teams:
Individuals are the people in your company who use Stemma. You can find individuals by their email address or a Slack handle.
Teams are represented in Stemma by Slack channels.
To assign one or more owners to an individual table, choose the table in the Stemma UI and click on Add Owner at the bottom of the left panel.
But Stemma also allows you to bulk-assign table ownership across the catalog. You can do this in a single transaction, immediately allowing your users to identify the assets that belong to them.
Assigning owners in bulk
You may think about ownership as a hierarchy, or your teams may typically own all of the tables within a single database or schema, or perhaps your analysts use Dbt to create their own workflows and create tables with specific naming conventions (e.g. marketing_sales, marketing_prospects). Regardless of how your company thinks about grouping assets, Stemma can help you to bulk-assign owners, using advanced search to find any of the assets you want to update and assigning the owners with a single command. To assign ownership of tables in bulk, proceed as follows:
From the home page of the Stemma UI, navigate to Advanced Search.
At the top of the left panel, make sure only the Tables resource is checked (you can’t bulk-assign owners to dashboards).
Optionally use the filters in the left panel to limit the search results to the set of tables to which you want to bulk assign ownership:
Stemma suggests you limit your search results using only the filters on the left; while you can also use the search bar at the top, doing so makes it more likely that tables that you don’t want included slip into the results
Click on Bulk Edit at the bottom right of the page and choose Add Owners.
In the dialog box, enter or select an email address or Slack channel and choose Add.
To add more owners, repeat Step 5 for each new owner.
When you are finished, click Add Owners and then Done.
Stemma automatically pulls in the teams you’ve defined as well as the list of users in your organization:
Note: When you assign ownership in bulk, this does not trigger automatic messages to the owners and users; that only happens when you change to assign ownership individually.
Displaying Data Lineage in the Stemma UI
Navigating the Lineage Graph
Column filtering
From the root node, search for columns of interest using the search bar in the right hand node panel
Click on columns that have a filter icon to prune the graph to only column-level relationships for the selected column
To examine columns in a related node, select that node
Below the search bar in the node panel you can select between “Filtered” and “All columns” to view the appropriate list of columns for the selected node
To filter by a different column, simply select the root node and then select any other column that has a filter icon
To remove the filter, simply select the root node an then re-select the filtered column
Focus mode
Select any node in the graph to initiate focus mode and make it easier to see tables and dashboards related to the selected node and the root node
When a collapsed node is part of the graph, it will automatically show the relevant table or dashboard with an expand icon
Click the expand icon to break the table or dashboard out of its collapsed group
Dashboards
When the selected node is a dashboard, the right-hand node panel will show charts and queries instead of columns
If the dashboard is the root of the table, any chart with lineage will display the filter icon which can be clicked to show only upstream data sources for that chart
Finding Tabular Lineage for Columns
End-to-end lineage
Navigate to the details page for the table
Click on the Lineage tab near the top of the page
In the Filter by column dropdown menu near the top of the Upstream or Downstream section, select the column with lineage that you are interested in
The table will now show related tables and columns alongside their distance from the current table
Simple
Navigate to the details page for the table.
Click on About tab near the top of the page.
Click on the up-arrow next to the column whose lineage you want to see
The row will expand to show column details, including the number of immediate upstream and downstream relatives of the column
Click the upstream or downstream links to see a pop-up table of the related columns
Evaluate data – For example, if you are not seeing the results you expected from a given table or dashboard, the problem may be one of timing: use the UI to check all upstream tables and make sure they ran when they should. To make this even easier, you can filter the graph to trace a specific column’s lineage.
Plan changes: The interconnected nature of data products– with multiple products often chained together to build upon each other– means that even the smallest change in a pipeline can ripple downstream and disrupt your business. These small but potentially disruptive changes are frequent and necessary:
An analytics engineer builds a table using Dbt; that table becomes popular and several dashboards are built on top of it. It becomes well trusted, and the logic becomes a standard for calculating certain metrics. But the table is missing several important entity IDs (foreign key columns) which makes it difficult to join to other tables. The analytics engineer and a data engineer work together to complete a “micro migration” to a new, more performant and reusable table.
A machine-learning model is used to predict churn for a customer. Each day after the model runs, users that are predicted to churn are identified in a table in Snowflake and then sent an email to re-engage them. The product team is launching the second major product and wants to measure churn on a per-product basis. The engineering team updates their data pipeline to add a new column to the output table that identifies the product that churn is being predicted for.
An analyst created several views to consolidate the common logic that the team were using to query tables in their dashboards. But the performance of the views has deteriorated as the amount of data has grown. The analyst and the engineer work together to create materialized versions and to update all of the dashboards using the prior views.
Use Data Lineage in Stemma to Identify Affected Assets
The challenge with such changes is that it is hard to accurately curate a list of all assets that will be affected. If you cannot be certain of what will need to be changed or what could break, you cannot do risk analysis or even make a time estimate. Stemma enables you to do the analysis and create the appropriate messaging.
The key question for this type of change is “What tables and dashboards will I need to update if I rename this view?”
Stemma provides a lineage graph for each table and dashboard. From here you can see who and what is utilizing your resources downstream, and how popular your assets are. Data owners can use this in conjunction with their own knowledge to build roadmaps and plan how data should evolve.
Use Stemma Messaging to Communicate Expected Changes
Everyone affected needs to know what is going to change and when. After building your list of affected assets and their users, use Stemma messaging to provide advance notice and instructions.
How Stemma Discovers Data Lineage
When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.
In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
Note
When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
Along with a direct connection to various data sources, Stemma can ingest data using extracts in various file formats. One of the most commonly used file formats is CSV.
Below are some of the CSV Formats Stemma ingests today (along with the sample files), with the column names and values we expect in each file. Note that you need the column header literals (for example,db_name, etc.) as well as the values.
Github Links
Used to provide a link to a Github source from Stemma’s Table detail page.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. source: the Github URL for the repository source_type: static value, “github”
Airflow
Used to generate a link between a table and an Airflow task. Once the data is ingested, you will see an “Airflow” button on the Stemma detail page for the table.
task_id:: the Airflow ID for the task dag_id: the Airflow ID for the DAG exec_date: the most recent execution timestamp for the task application_url_template: the URL for the DAG in your Airflow instance db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. schema: the name of the schema within the database. table_name: the name of the table within the schema. cluster: the name of the database within the data source.
Dagster
Used to link a Dagstr op and/or job to a table in Stemma. Once the data is ingested, you will see an “Dagstr” button on the Stemma detail page for the table.
application_url_template: the URL for the DAG in your dragster instance db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema.
Table Tags
Stemma can create and assign tags via a CSV file, with the following format:
db_name
cluster
schema
table_name
tags
hive
gold
test_schema
test_table1
“tag1,tag2”
hive
gold
test_schema_2
test_table2
“tag1,tag3”
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. tags: the tags to be created. Tags are additive: existing tags on tables will not be modified.
Table Description
db_name
cluster
schema
table_name
description
hive
gold
test_schema
test_table1
Description for table, optionally markdown.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. description: the table description. Note that descriptions are destructive: if a description already exists, it will be overwritten.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. owners: email addresses or slack channels. Owners are additive: existing owners of tables will not be modified.
Column Descriptions
db_name
cluster
schema
table_name
col_name
description
hive
gold
test_schema
test_table
test_column
This is an example column description.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. col-name: the column described. description: the column description. Note that column descriptions are destructive: if an existing description is present, it will be overwritten.
Finding Important Information about Tables and Columns
Stemma provides out-of-the-box integration with applications such as Snowflake, dbt, and BI tools such as Looker, Mode and Tableau to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, column-to-column lineage, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.
The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.
The table details page provides following information:
About
Tags
Descriptive tags for cross-referencing, such as covid, cases, etc.
Asset status
This can be Certified, Deprecated, Intermediate, or No Status
Owners
The person or people responsible for maintaining this table
Click on an owner to see other assets owned by that user, as well as their bookmarks and frequently used tables. You can also email the user via this profile page.
The date and time of the last update to the table (this can be different from the most recent date covered by the data itself)
Date Range
The date range covered by the data in the table.
Columns
Descriptions are automatically ingested from a source (like dbt or Snowflake). If there is no description to ingest, Stemma will automatically use column-level lineage to search upstream for a description in an upstream column with the exact same name. This documentation is shown as “Autodescribed.”
Column descriptions can be created, and autodescribed documentation over-written, within Stemma. Also available in this table are the data type, related glossary terms and badges.
Lineage
dbt Compiled Query
For dbt users, see the query that formed the table, as well as the dbt model type.
Upstream, Downstream
Upstream and downstream tables and dashboards, sorted by popularity.
Where possible, Stemma captures the information directly from the source. This table shows which attributes are populated from the source and which you can edit:.
Table Metadata
Automatically captured?
Users can edit?
Table description.
✅
✅
Asset status.
❌
✅
Owners
✅ (dbt)
✅
Tags
✅ (dbt)
✅
Jira issues
❌
✅
Last updated time
✅
❌
Data date range
✅
❌
Commonly used with tables
✅
❌
Frequent users
✅
❌
Upstream & downstream table lineage
✅
❌
Dashboards that use the table
✅
❌
Slack threads with table mentions
✅
❌
Advanced Description Editing
The Stemma UI provides a WYSIWYG editor for descriptions of tables and columns.
Using the editor
To begin editing, navigate to the table or column whose description you want to edit, and click on the pencil icon next to Description:
This opens up the text box for editing:
Add or change the description.
Descriptions can include links and images as well as text.
You can choose font styles as indicated by the buttons.
Use the pull-down menu to toggle the text format from the default (Normal) to common formatting elements such as headings, lists, quotes, and code blocks.
When you choose Code Block, you can choose the type of code from the secondary pull-down menu (the default is JavaScript).
You can add hyperlinks: enter and highlight the text to display, then click the 🔗button and enter the URL (and edit it if necessary) in the text box next the button.
The editor also supports Markdown syntax, as well as any shortcuts your system supports (such as ⌘-b for bold on a Mac)
When you are finished, click Update to save your changes.
Linking to other tables from the description
This is useful, for example, if you want to give other users more information about the relationship between the tables.
To link to another table from the description, do the following:
In the Stemma UI, navigate to the table whose description you want to edit.
Click on the pencil icon next to Description.
Enter the new desription providing the related table’s<schema_name>.<table_name> anywhere in the text.
Click Update to save your changes.
The new description will contain the link to the related table:
Sharing column descriptions
You can share a column description with all columns of the same name across the catalog. To do this:
In the Stemma UI, navigate to the table whose column description you want to share.
To add or modify a description, click on the down-arrow to the left of the entry for the column:
Hover over the word Description and click on the pen that appears:
Add or update the description and click on Update. Stemma automatically discovers all columns with the same name, and the description propagates to all linked columns. To see which columns are linked, and add and remove links, click on the three dots to the right of the column entry:
Click on the Edit linked columns pop-up. In the dialog window, you can:search for and add columns to link, and remove existing links.
The Stemma UI makes it easy for you to discover this important information about your dashboards, and provides a convenient way to add and modify curated information.
Status
This can be Certified, Deprecated, Intermediate, or No Status
Description
A brief summary of the contents of the dashboard:
Owners
The person or people responsible for maintaining this table:
This information is extracted automatically from the source and (unlike table ownership) can’t be edited in Stemma.
Tags
Descriptive tags for cross-referencing
Contents
Total Views
Total number of views over the life of the dashboard
Created
The date and time the dashboard was created
Last Updated
Date and time of the last update to the dashboard
Last successful run
Date and time the dashboard was last run successfully
Last run
Date and time the dashboard was last run (successfully or not)
Tables
Tables used by the dashboard
Charts
Charts available from the dashboard
Queries
Queries run from the dashboard. Click on “View Query” to see the code and copy it to your clipboard.
Open folder
Open the dashboard’s folder in the underlying application (you may need to log in to the application)
Open dashboard
Open the dashboard in the underlying application (you may need to log in to the application)
Using Admin Functions and Reports
API Keys
Allows you to add and delete keys for the Graphql API.
These reports provide an account of your organization’s interactions with the Stemma catalog, including curating work that needs to be done, as well as usage.They can be especially useful when you are rolling our Stemma to your users and scoping the work remaining to do.
Customer Summary Shows the number of data sources, tables, and dashboards that are integrated into the Stemma catalog.
Monthly product usage Shows the number of users and searches by month.
Documentation summary Shows the percentage of columns, dashboards, and tables missing documentation and/or ownership.
Assets without docs Shows the specific tables and dashboards that lack documentation. Click on the drop-down to switch between the two types of asset
Assets without ownership Shows the specific tables and dashboards that lack ownership (click on the drop-down to switch between the two types of asset).
How Stemma Captures Data: SQL Parsing Overview and Caveats
What Stemma captures
When you execute SQL against your warehouse, the database captures the raw SQL that was executed and saves it in a historical format that includes the time that the SQL command was executed as well as the user who executed it. Stemma reads these SQL logs, along with access history, and parses all of the SQL commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page:
Table and column lineage: any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
User table usage: Stemma uses select statements to associate users to the table(s) being queried; this provides the Frequent Users information on the Table Details page, search rankings, and the user’s profile page.
Commonly Used With: any joins that occur within the SQL, whether on the top-level query or as part of a sub-query, are captured and presented on the Table Details page to help other users understand how two or more tables are commonly joined.
Caveats:
When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
When user IDs in the database do not match the user portion of the user’s email address as recorded in Stemma, Stemma cannot associate the Stemma user to the user in the query logs. This prevents Stemma from creating any user-table usage information.
When system accounts are used to query the data, Stemma is often not able to see which user is executing the query. This is a common practice with BI tools when multiple team members share the same connection to a database and that connection uses a system account.
However, if you are using Mode, Stemma can parse the comments appended by Mode (which do contain the user ID) to find the user and associate that user to the query.
Commonly Used With does not capture joins where the left or right table is a dynamically generated select statement, for example: select * from table_a a join (select * from table_b b1 join table_c c1 on b1.id = c1.id) b on a.id = b.id.
Planning Changes to Tables or Columns
When you are planning to make a backward-incompatible change, you need to proceed in a way that minimizes the impact and gives downstream stakeholders as much warning as possible. Examples of backward-incompatible changes are:
Changing a schema (type changed, name changed)
Changing values (data fields will no longer be populated, or will be populated differently)
You need to understand and communicate the impact before you make such a change. This is usually done in the following steps:
Understand the impact: understand who uses or queries this data.
Provide heads-up notification: notify downstream owners and users that this data is going to change, and that they need to update their assets (downstream tables or dashboards) accordingly. Provide a time window after which the change will go into effect.
(ideally): Understand the impact again: see if downstream users have changed their assets; how many still need to be nudged, etc.?
Make the actual change and follow up to make sure that it’s complete.
Stemma can help you make the change as painless as possible. Use the UI to assess the effect on downstream users and on other assets. The following sections provide help:
Now warn all owners and frequent users about the change, and give them a deadline to make their adjustments.
Note
If you decide to change the name of a table to indicate that it’s deprecated (e.g., to something like deprecated_tablename), that will cause it, by default, to be removed from Stemma after ten days.
Using the External GraphQL API
Creating and Installing an API Key for the External GraphQL API
1. Create an API key in the Stemma UI:
a. Navigate to the Admin panel and choose Add API Key:
b. In the pop-up window, enter a name for your API key and click Submit. You can choose any name, but Stemma recommends using a name that describes the intended use of the key (e.g., github action integration):
Stemma generates an API key that appears in the window. Save this key somewhere secure. It is not recoverable.
Now click Docs in the upper right of the Altair UI; you should see the graphql schema on the right hand side:
Getting Started with the Stemma Business Glossary
What Is a Business Glossary?
A business glossary is a repository of business terms and their definitions that an organization uses to ensure the same definitions are used company-wide when analyzing data. It allows data analysts, business users, and data engineers to agree on the definition of real-world terms, and how those are represented in the data; and serves as the source of truth for those definitions
Used this way, the glossary produces a common vocabulary that everyone in the organization uses. This common vocabulary is a key component of data governance, ensuring that everyone means the same thing when they use the same term. For instance, one department may define the term “customer” as a company while another department may use the term to refer to an individual. A business glossary prevents such discrepancies.
In short, a business glossary provides canonical definitions of common terms and fosters shared understanding of those terms throughout the organization.
The Stemma Business Glossary
This first version of the Stemma Business Glossary allows you to define and group terms that are used in your organization and assign owners to those terms so that the definitions remain consistent and up to date. These terms can be specific to the data sources in the Stemma catalog, or they can be terms that are used throughout the organization or any part of it.
You can also link those terms to table columns in the Stemma catalog.
In future, you will also be able to link terms to other resources in the catalog, and to queries and search results.
Using the Glossary
Populating the Glossary
To begin populating the glossary, proceed as follows.
From the navigation bar of the Stemma UI, choose Glossary:
Choose + New Group or + New Term.
In the dialog window, add a name and a definition or description.
You can use Markdown syntax for the description or definition. – A group is a collection of related terms. For example, you might want to group terms by department or function– finance, marketing, etc. – You can add a term to an existing group, or create a new term without assigning it to a group. (You can’t create a group on the fly while adding or editing a term).
– You can add related columns and one or more owners: start typing to find candidates.
When you are finished, click Save.
Updating the Glossary
To update the glossary:
From the navigation bar of the Stemma UI, choose Glossary.
To modify a group, click on the group and then click Edit. You can change the group’s attributes and also add new terms to it.
To modify a term, search for it by name, or click the > next to the group and select the term from the list;. then either:-
Click Edit and make your changes to any or all attributes of the term. You can change the group the term belongs to, so long as the new group already exists.
OR
Click Delete to remove the term from the glossary.
When you are finished, click Save.
Integrating Your Applications with Stemma: What We Need from You
What we need to get you up and running depends on the applications you will be integrating with Stemma, but in general you will need to provide the information and do the tasks for Slack, SSO, and your dashboard and storage applications.
Stemma’s Athena integration supports: extracting metadata information (tables, columns, etc)
Proceed was follows to create an Athena connection to Stemma.
Create a user
To integrate with Stemma we recommend creating a new user in IAM that has access specifically scoped to the actions and resources required. The following script enables the required access:
Terraform
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 3.0"
}
}
}
provider "aws" {
# You are only creating IAM which are global, you may
# change the region but it should not impact this script
region = "us-east-1"
}
resource "aws_iam_user" "stemma_read_user" {
name = "stemma-read-user"
}
resource "aws_iam_user_policy" "stemma_read_policy" {
name = "stemma-read-policy"
user = aws_iam_user.stemma_read_user.name
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"athena:GetDatabase",
"athena:ListDataCatalogs",
"athena:GetDataCatalog",
"athena:ListDatabases",
"athena:ListTableMetadata",
"glue:GetDatabases",
"glue:GetTables"
]
Effect = "Allow"
Resource = "*"
},
]
})
}
Generate credentials
After creating the user, generate an AWS Access Key and Secret for the user.
Provide credentials to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose AWS as the Connection type, provide a Connection Name, choose Access Key as the Authentication type, provide the AWS Access Key and AWS Access Key Secret you have just created, and specify the AWS Region:
When you are finished, click Save Connection.
Testing the access
Optionally, you can validate the access to the catalog, databases and table that Stemma will have with the following code snippet; make sure to replace ACCESS_KEY and ACCESS_KEY_SECRET with the values for the Stemma Athena user:
import json
import boto3
from botocore.config import Config
client = boto3.client(
"athena",
aws_access_key_id="{ACCESS_KEY}",
aws_secret_access_key="{ACCESS_KEY_SECRET}",
config=Config(region_name="us-east-1"),
)
all_tables = []
for catalog in [cat['CatalogName'] for cat in client.list_data_catalogs()['DataCatalogsSummary']]:
for database in [database["Name"] for database in client.list_databases(CatalogName=catalog)['DatabaseList']]:
for table_name in [tbl['Name'] for tbl in client.list_table_metadata(CatalogName=catalog, DatabaseName=database)['TableMetadataList']]:
all_tables.append(f'{catalog}.{database}.{table_name}')
print('All tables available:')
print(json.dumps(all_tables, indent=2))
On the Azure Active Directory page, click Enterprise applications.
Click New application on the top menu.
Click Create your own application.
Enter a name for your application, we suggest “Stemma Data Catalog“.
Select Register an application to integrate with Azure AD (App you’re developing).
Click Create. This should redirect you to Register an application.
Under Supported account types, select Accounts in this organizational directory only (Single tenant).
Under Redirect URI (optional), select Web. If you are running a hosted version of Stemma, the redirect URI should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name. In case of on-premise deployment, please consult your Stemma representative for this.
Click Register.
Create a client secret
To create a client secret, follow these steps:
Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
Select your newly added application. Make a note of the Application ID.
Click Certificates & secrets on the left-hand side and click New client secret.
Provide a description and an expiration length that follows your security organization’s guidelines.
Click Add.
Copy the Value corresponding to the newly-created client secret and client secret expiration. This is the Relying Party OAuth Client Secret. This secret value is never displayed again after you leave this page.
Set up application permissions
You need to set up two application permissions in your Azure portal so that you can use all of the Azure AD synchronization options in Stemma. You need to set up the following permissions:
Azure Active Directory Graph Directory.Read.All
Microsoft Graph Directory.Read.All
To set up permissions, do as follows:
Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
Select your newly added application.
Click API permissions on the left-hand side and click Add a permission.
Click APIs my organization uses and click Windows Azure Active Directory.
Select the Azure Active Directory Graph > Directory.Read.All permission. a. Click Application permissions. b. Under Directory, click Directory.Read.All. c. Click Add permissions.
Select the Microsoft Graph > Directory.Read.All permission. a. Click Application permissions. b. Under Request API Permission, click Microsoft Graph. c. Under What type of permissions does your application require?, click Applicationpermissions. d. Under Directory, click Directory.Read.All. e. Click Add permissions.
Click Grant admin consent for and then click Yes. You should see a message saying that you’ve granted consent for each of your permissions.
What We Need From You
Provide Stemma with the following information by email or Slack:
Application ID
Client Secret
Discovery Document Endpoint (Well Known OpenID Configurations)
Tenant ID – The ID of the tenant you created the application in.
BigQuery Integration
Stemma extracts table and column metadata including database, schema, table name, table description, column name, and column description from a Bigquery database.
What We Need From You
You will need to create a service account for reading metadata and grant it “BigQuery Metadata Viewer” access to all of your datasets. This can all be done via the BigQUery UI.
Next, collect the information Stemma needs, as follows:
Authorization/Access Scopes: Stemma uses the following scopes to extract metadata and usage information. — <https://www.googleapis.com/auth/bigquery.readonly> — <https://www.googleapis.com/auth/cloud-platform>
A JSON Key File: Proceed as follows to create the JSON key file for a service account.
In the Service account name field, enter a name. The Cloud Console fills in the Service account ID field based on this name.
In the Service account description field, enter a description. For example, `Service account for Stemma BigQuery Integration
Click Create and continue.
Click the Select a role field, and make sure to select the following: — BigQuery Data Viewer (roles/bigquery.dataViewer) – to extract the metadata, watermarks and usage information. Popular tables will be generated using this information. — BigQuery User (roles/bigquery.user) – Needed to map the user’s information with the Tables i.e., Frequent Users, Table Owners, etc. — Private Logs Viewer (roles/logging.privateLogViewer) – Needed for enriching table usage.
Click Continue, and click Done to finish creating the service account.
Do not close your browser window. You will use it in the next step.
Create a service account Key:
In the Cloud Console, click the email address for the service account that you created.
Click Keys, click Add key, then click Create new key.
Click Create. A JSON key file is downloaded to your computer.
Click Close.
Provide the information to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose Google Cloud as the Connection type, provide a Connection Name, and paste in the Service Account Key from the JSON file you have just downloaded.
When you are finished, click Save Connection.
dbt integration
Stemma integrates with your dbt Cloud or Core projects and is able to ingest and keep up to date all of the metadata in the table below.
Integrating with dbt Cloud requires minimal information and access to the dbt Cloud account of your organization and an administrative access to Stemma.
The following 3 properties are needed from dbt Cloud:
Account ID
Usually visible within the browser URL on the account settings page: https://cloud.getdbt.com/next/settings/accounts/<account-id-number> or within any any dbt Cloud project: https://cloud.getdbt.com/next/deploy/<account-id-number>/projects/<project-id>.
For organizations on the Team Plan, the account wide permissions required are Metadata and Read-Only.
For organizations on the Entreprise Plan, the necessary permission set is Account Viewer and Job Viewer for all projects that should have metadata ingestion.
Host
Unless organization specific customization has taken place, the host is likely cloud.getdbt.com
With these credentials in hand, all that’s left is adding the connection information to Stemma:
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection.
In the pop-up window, choose dbt as the Connection type, provide a Connection Name, and fill in the API Token, Host and Account ID collected from dbt Cloud.
When you are finished, click Save Connection
dbt Cloud integration may take up to 24 hours for initial ingestion.
The S3 prefix path will be “dbt/{date}” where “{date}” is the current days date in UTC in the format “YYYY-MM-DD”. For example, for a file delivery on 04/01/2022 we would expect to see the following files:
Proceed as follows to create a Delta Lake connection to Stemma.
Step 1: Generate a Personal Access Token
Follow these instructions to generate the Delta Lake personal access token that Stemma will need to extract your data into the catalog.
Step 2: Provide the Personal Access Token and Hostname to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose Delta Lake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows
Provide the Personal Access Token you created in Step 1
Provide the fully-qualified domain name of the Delta Lake host; for example mydlhost.mydomain.com.
When you are finished, click Save Connection
Google OIDC Integration
This document walks you through the steps to create and configure a Google OIDC application that Stemma can use to authenticate users.
Note:You must be a Google administrator to do this.
Steps
Log in to Google Developer Console
On the left, select Credentials
Select + CREATE CREDENTIALS at the top
Select “OAuth Client ID”
Enter the following information:
Application type: Web application
Name: You can choose the name; we suggest “Stemma OIDC” as a descriptive example
Download the Client ID and Client Secret JSON file and provide them to Stemma by email or Slack.
Hive Integration
What we need from you
Stemma needs certain information and credentials to extract Hive metadata into the catalog. Contact Stemma by email or Slack, and provide the following:
Metastore Host: Hostname is the IP address of the Hive server to which you are connecting.
Username and Password: You will need to provide a username and password for Stemma to use to access the Hive schema.
Metastore Port: Server port used for accessing metadata about hive tables and partitions. The default Hive metastore port is 9083.
List of Databases: Stemma whitelists databases, and so you will need to provide a list of the databases we will be importing.
Metadata extracted
The metadata Stemma extracts includes:
TBLS – stores basic information about Hive tables, views, and index tables.
DBS – stores the basic information of all databases in Hive.
PARTITION_KEYS – the field information of the table storage partition.
TABLE_PARAMS – stores the attribute information of the table/view.
SDS – saves the basic information of file storage, such as INPUT_FORMAT, OUTPUT_FORMAT, whether it is compressed or not.
COLUMNS_V2 – stores the field information corresponding to the table.
PARTITIONS – stores the basic information of table partitions.
JIRA Integration
This document walks you through the steps you need to take to integrate Stemma with your JIRA installation. You must be a JIRA Admin to do this.
Steps
Step 1 – Collect information about your JIRA environment
Collect the following information to provide to Stemma (see Step 3):
The URL for your JIRA environment (e.g., <https://jira.net>)
The project ID for the JIRA project you would like Stemma to create tickets in See this page for guidance from Atlassian on how to obtain the project ID for a JIRA project
If using JIRA Server or Cloud, the ID for the JIRA Bug issue type
Note: Skip this step if using JIRA on-premise
See this page for guidance from Atlassian on how to obtain the ID for the Bug issue type
Optional: The name of one or more labels you would like set on tickets created by Stemma
Step 2 – Create a service account
Stemma will need an email and API token to authenticate to your JIRA environment. Our recommendation is to create a JIRA service account so that the authentication is not tied to an actual user’s account. a. Go to Administration > User management > Users b. Click Create User and fill out the required information c. Create an API token from the service account
For JIRA on-premise this feature is not currently supported so you will provide the password instead
d. Grant permission to the service account
The service account will minimally need access to create and view tickets in the JIRA project you intend to use
You can either grant permissions directly to the service account or you can grant the service account a role
Stemma needs:
The email address of the service account
The API token or password
See Step 3.
Step 3 – Provide the information to Stemma
Provide the information from Steps 1 and 2 to Stemma by email or Slack.
Linking JIRA Issues to Stemma
To link a JIRA issue to a table in the Stemma catalog, you need to include a Table Key tag in the issue description. The tag format is Table Key: {SOURCE}://{DATABASE}.{SCHEMA}/{TABLE} [PLEASE DO NOT REMOVE].
Looker Integration
This document walks through the steps required to create a service account in your Looker, which will be connected to your Stemma Instance for dashboard metadata extraction.
Stemma uses the Looker APIs (via Looker Python SDK) to extract the metadata of Folders, Dashboard, Dashboard Elements and Explore Queries, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Looker. Read Looker’s docs: https://docs.looker.com/reference/api-and-integration
Stemma takes leverage of the API3 keys (Client ID & Client Secret) to make a secure connection between your Stemma and Looker instances. Below you can find out the steps required to create the new API3 keys for this integration.
What is extracted?
Stemma extracts information about Folders, Dashboard, Dashboard Elements, Explore Queries, and data sources associated with them.
Stemma will not extract Dashboards that are deleted, hidden or in a user’s personal folder.
Steps
Step 1: Make sure you have a Looker Admin role
Looker → Stemma integration requires an Admin role to fetch the correct data and build the relationships between the entities like Table Lineage, Dashboard Owners, etc.
So, as a first step, make sure you have an Admin role.
Step 2: Generate API3 keys
The next step is to create an API3 Key. From the Admin menu, navigate to the Users page.
Find the user you want to use to generate the API3 key, and click the “Edit” button under the Actions column.
Once you are on the “Edit User” page, find the API3 Keys settings, and click “Edit Keys“
Click “New API3 Key” to generate a new Client ID and Client Secret.
Step 3: Provide the details to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Looker as the Connection type, provide a Connection Name, and then fill out the fields in the dialog box as follows:
Base URL Example: <https://stemma.cloud.looker.com/>
The Stemma Mode integration accesses the Mode Discovery API. This API is distinct from regular Mode API: it is much more performant but only contains a subset of the available endpoints. This means Stemma is able to access only metadata about reports, not the report runs/results. Stemma also accesses the regular REST API to augment the information retrieved from the Discovery API. In all, there are four values required by Stemma:
Your organization name in Mode
The access token to access the REST API
The password token to access the REST API
A signature token to access the Discovery API
Optionally, if you would like Stemma to only extract metadata from specific Mode spaces you may also provide Stemma a list of the spaces to include.
The access token and password token allow Stemma to augment the information within the Discovery API with the regular REST API to unlock two additional capabilities:
The ability to link your dashboards to the underlying tables in your warehouse which are used to serve the dashboards.
Stemma will use the general Mode API for datasources to retrieve your data source information to find out the name of the database (you may use, PUBLIC, mktg, etc.) as well as the type of data source (e.g. Snowflake, Redshift, etc.) in order to apply the proper SQL parsing semantics.
The ability to rank your dashboards by the number of times the dashboard has been viewed in the past 90 days.
Stemma will use the general Mode API for reports API to retrieve an accurate view count.
Creating the Signature Token
To authenticate with the Mode Discovery API a signature token must be created. This signature token can be created by running a curl command using the account’s REST API access token and password secret. The following code is a snippet from Mode that has been slightly adjusted to create a signature token that Stemma can use. Make sure to replace {organization} , {your-access-token} and {your-access-secret} with the appropriate values (organization can be found in the URL when you visit Mode, e.g. https://app.mode.com/<organization>/spaces). Visit the Mode documentation for information on creating the api-key and api-secret.
Please note, if this does not work it likely means you do not have permissions. Please contact Mode support with help creating a read-only signature token.
APIs used
APIs are listed in the table below. In the table below, Data Extracted is the data we actually store from the API (and may change as we add functionality). Risks highlights any potentially-sensitive information in the API (whether or not we actually store it).
Please note, that the “Report Stats” endpoint is not enabled by default in mode, so functionality coming out of it may not be available, until the endpoint is enabled.
Low: contains access to information about the data source. Stemma should separately have access to this information but this API access is required to link a mode dashboard to a specific database.
Notes
Here are some additional notes about possibly sensitive information we may receive from Mode
Dashboard execution times are non sensitive on their own, although it is possible to infer NPMI from who is running which dashboard queries at precise times. Stemma only displays aggregated forms of this information, but stores it disaggregated.
Query names may contain NPMI (e.g. a query titled “EMEA Expansion Potential”, when the company hasn’t announced its intention to expand into EMEA).
Raw SQL queries may contain sensitive information (PII), typically in the form of filter clauses. This is generally not allowed under most security policies especially under Open Collections, however, it is a possibility. By default, these SQL statements will be shown to users in Stemma.
Configuring Your Stemma Connection
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, optionally add a name for the connection, and choose Mode from the Connection Type drop-down menu.
This document walks you through the steps to create and configure an Okta application that Stemma can use to authenticate users.
Note: You must be an Okta administrator to do this.
Steps
Log into Okta admin portal
In the left frame, select Applications :
Select Create App Integration:
Select the options for OIDC – OpenID Connect and Web Application.
In the New Web App Integration, enter the following values:
App Integration Name: Your app name, we suggest Stemma Data Catalog
Grant Type: select the following –
Authorization code
Client Credentials (Client acting on behalf of itself)
Refresh token
Sign-in redirect URI:
Use the fully qualified hostname and add the endpoint:
/auth
If you are running a hosted version of Stemma, this should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name
Sign-out redirect URI This should be the base Stemma URL. Example:
https://<customer>.stemma.ai
Trusted Origin:
If you are self-hosting Okta, provide the Base URI for your Okta Sign domain, otherwise, leave this blank
Assignments:
Select the assignment access you would like within your organization
Stemma will require access by certain APIs to retrieve user information from your organization’s directory.
Navigate to Okta API Scopes :
Make sure to grant permission for the following scopes:
okta.users.read
okta.users.read.self
Provide the following values to Stemma by email or Slack so that authentication can be built into your Stemma deployment:
Okta Discovery endpoint (well-known configs)
Client ID
Client Secret
Redshift Integration
What we need from you
Stemma needs certain information and credentials to extract Redshift metadata into the catalog. To provide these, proceed as follows.
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Redshift as the Connection type and then provide the following information in the dialog box:
A username and password for Stemma to use to access the INFORMATION_SCHEMA
The hostname of the Amazon Redshift cluster. You can find it in the Properties tab of your Redshift Cluster Details page. Under the Connection details section, copy the Endpoint up to but not including the colon.
A port number. The default port number for Redshift is 5439. From your Redshift Cluster Details page, you can find the port number under the Database configurations section of the Properties tab.
A list of databases: Stemma whitelists databases, so you will need to provide a list of the databases we will be importing.
Access
The Stemma Redshift integration requires access to two pieces of information within the Redshift install:
information_schema metadata
Query logs
information_schema
The specific tables and views we access include:
PG_GET_LATE_BINDING_VIEW_COLS
svv_external_columns
COLUMNS
Because of the way permissions work in Redshift, these tables are only populated with data to which the user has SELECT access. If you don’t want to use column stats, and therefore want to restrict access for this credential, Stemma will give you the SQL code required to create a materialized view. You create that materialized view as a superuser, and grant our credential the ability to read and refresh that materialized view.
Query logs
Stemma Redshift requires query logs. We can extract these in two ways: push or pull.
Push
The recommended integration involves configuring Redshift to dump user activity logs into S3.
There are multiple methods to achieve this, either using CloudTrail or directly from Redshift. the exact right way differs between organizations (many already have audit logging configured for other purposes).
Pull
Stemma can extract logs from STL_QUERY. However, because this option requires the credentials be a superuser, it is not the recommended integration method.
Column stats – Optional
The Stemma integration user must have full SELECT access to any tables or views for which you want to see column stats.
Slack Integration
This document walks through the steps required to install a Stemma Slack application in your workspace.
Prerequisite
Stemma does not publish the Stemma Slack application to the Slack Marketplace. To install this application for your workspace, you need to create a new application using an application manifest file.
If you have an application manifest file, make sure you have it handy before you begin. If you don’t have a manifest file, you can copy and paste the following into a file:
_metadata:
major_version: 1
minor_version: 1
display_information:
name: Stemma
description: Help you enrich your metadata directly via Slack
background_color: "#0a1930"
features:
bot_user:
display_name: Stemma
always_online: true
slash_commands:
- command: /stemma
url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/slash/help
## Replace {SUBDOMAIN} with your Stemma subdomain.
description: Learn how StemmaBot works
usage_hint: help
should_escape: false
oauth_config:
scopes:
bot:
- app_mentions:read
- channels:history
- channels:read
- chat:write
- commands
- groups:history
- groups:read
- links:read
- links:write
- users:read
- users:read.email
settings:
event_subscriptions:
request_url: https://{SUBDOMAIN}.stemma.ai/stemma/slack/events
## Replace {SUBDOMAIN} with your Stemma subdomain.
bot_events:
- app_mention
- link_shared
- member_joined_channel
- message.channels
- message.groups
- team_join
interactivity:
is_enabled: true
request_url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/interactivity
## Replace {SUBDOMAIN} with your Stemma subdomain.
org_deploy_enabled: false
socket_mode_enabled: false
token_rotation_enabled: false
Save the file with any prefix and in any location you like, but make sure:
The filename has the suffix .yaml.
You replace {SUBDOMAIN} with your Stemma subdomain.
Steps
Step 1: Verify your application manifest file
Make sure that at least the following properties in the application manifest file match your Stemma endpoint:
features → slash_commands → url:
settings → event_subscriptions → request_url
settings → interactivity → request_url
Step 2: Create and install your Stemma Slack application
Create a new application using the application manifest file.
b. You will be prompted to select a method to create a Slack Application. Select From an app manifest, and choose the workspace in which you want to install the Stemma Slack application.
c. Paste in your manifest file. Make sure you select the YAML format, and paste in the contents of the file (see Prerequisite above).
d. Click Next and verify the scopes and events the Stemma Bot is requesting. Finally, click Create to install the application to your workspace.
Step 3: Provide Signing Secret and Bot Token to Stemma
To secure the communication between Slack and Stemma, you need to provide Stemma with the Signing Secret and Bot User OAuth Token of your newly installed Stemma Slack application:
Signing Secret:
You can find the Signing Secret on the Basic Information page, in the App Credentials section.
Bot User OAuth Token:
You can find this token on the OAuth & Permissions page, in the OAuth Tokens for Your Workspace section.
Now provide the Signing Secret and Bot User OAuth Token to Stemma:
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Slack as the Connection type, provide a Connection Name, and then in the dialog box enter the Bot Token and Signing Secret.
When you are finished, click Save Connection.
Step 4: Receive confirmation from Stemma
Stemma will use the information you provide to integrate your Slack workspace with your Stemma Instance. Once you get the confirmation, you should be able to type /Stemma help in Slack to see usage help for the Stemma Slack application.
Step 5: Make sure to add an app icon 😊
Under the Stemma (Slack Application) basic information Tab, add the Stemma icon in the “Display Information” section. Users will be much more likely to use the bot if it’s easy to identify.
Right-click and “Save as…” and upload this icon to use as the Slack App icon
(Nice to have) Please invite Stemma team member(s) to a shared test channel
At this point, you have the Stemma Slack application installed within your Slack workspace. In order to verify if things are working just fine with no errors, you can create a temporary shared channel and invite a couple of members from the Stemma team to that channel.
Stemma needs certain information and credentials to extract Snowflake metadata into the catalog. To provide these, proceed as follows.
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Snowflake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows:
Credentials We support authentication with a username and either a password or a private key.
Username and Password: You need to provide a username and password for Stemma to use to access the INFORMATION_SCHEMA as described above.
Username and Private Key: If you set up authentication via key pairs you will need to provide Stemma with a private key to use when accessing INFORMATION_SCHEMA as described above. If your private key is encrypted, you will also need to provide Stemma with the passphrase.
Snowflake Account Name: If you use the snowflake web interface, the URL you use should look something like abc.snowflakecomputing.com. Everything that appears before Snowflake computing is your account name.
Warehouse: The name of the warehouse (also called Virtual Warehouse) Stemma will connect to.
Role: The role Stemma will use once connected. If Stemma is extracting query logs, this role will need to have MONITOR access on the warehouse.
When you are finished, click Save Connection.
Access
The Stemma Snowflake integration requires read access to all tables that will be included in the catalog, plus query logs.
Metadata
The specific tables and views we access include:
TABLES
COLUMNS
Due to the permissions model of Snowflake, these tables are only populated with data to which the role has SELECT access. Metadata will be extracted one database at a time since each Snowflake database has its own INFORMATION_SCHEMA.
Query logs
Stemma extract logs from the QUERY_HISTORY table function in the INFORMATION_SCHEMA. Logs are used to inform lineage, usage, common usage patterns and more. Due to the permissions model of Snowflake, this table is only populated with queries executed in virtual warehouses that the role has MONITOR access on.
Access History
Stemma uses ACCESS_HISTORY to enrich lineage derived from query logs. To grant Stemma access, Stemma’s user role will need to have the IMPORTED PRIVILEGES:
grant imported privileges on database snowflake to role stemma_role;
If Stemma cannot read Access History, lineage will still work however accuracy will be reduced.
Note: If you try giving the Stemma role access only to the required view, without giving access to the SNOWFLAKE database and ACCOUNT_USAGE schema before that, Snowflake will probably reject your request with the “Grant not executed: Insufficient privileges.” error.
Tableau Integration
Usage information
The Stemma Tableau integration will sum view counts for all workbooks in a particular dashboard over all time.
Installation Instructions
This document walks through the steps required to create a service account in your Tableau, which will be connected to your Stemma Instance for metadata extraction.
Stemma uses the Tableau Metadata API (GraphQL) to extract the metadata of projects, workbooks, dashboards, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Tableau. Read Tableau’s docs: https://help.tableau.com/current/api/metadata_api/en-us/index.html
Connection Type:
We support both Online and Extract Connections with no differences.
Steps
Step 1: Decide your authentication strategy
The Tableau Metadata API is authenticated using a Personal Access Token (PAT) attached to a specific user within your organization. You will always share with us the PAT, not the username/password of the user itself. The PAT has the same permissions as the user to which it is attached.
Most users simply will simply create a PAT attached to their admin account. This is the easiest to set up and simplest to manage longer term, and we recommend it. For this option, skip to Step 4.
However, because this means the PAT has write access to the instance and Stemma doesn’t use write access at all, this doesn’t strictly follow principle of least privilege. In cases where restricting such access is critical, we recommend creating a read-only service account, and then you’ll generate a PAT for that limited account. Continue to Step 2 for this option.
Step 2: Create a new Viewer user
If you decide to create a new user for this integration (otherwise skip to Step 4 to use your current user to generate the Personal Access Token), navigate to the Users menu from the side navigation.
Click on “Add Users” and select “Add Users by Email“. A popup will appear with a Form to create new Users.
Select the “Tableau” radio button, and write down the email address you want. You will not share this email address with Stemma, and use this to sign in to the Tableau instance in order to create a PAT.
Email Tip: If you use Google email, [your_current_user][email protected][your-domain] will allow you to receive emails on your current email address and satisfy the uniqueness requirement.
Select “Viewer” as the Site role, which will give this user minimum access to the Tableau, and click “Add Users” to finalize the user creation.
Step 3: Set up permissions for the user
The next step is to make sure the user has permission to read your externally connected data sources. This will be used to link the workbooks with the tables in Stemma. By default, a “Viewer” user will not have access to any data source.
Click on the “External Assets” from the side navigation, click on “Select All“. A new dropdown will appear. Click on the “Actions” dropdown and select “Permissions“.
A Permissions dialog will appear under the “Permission Rules” section. By default, you will be on the “Database” tab.
Click “+ Add Group/User Rule” and select the user you created in Step 2 of this document, set the Template “View” and Save the rule.
Click on the “Tables” tab, and repeat the steps to add the rule for the user with the “View” template.
Stemma uses information about the upstream data sources in order to automatically create data lineage from your dashboard to your upstream table. In order to enable this, Stemma needs access to read the data source metadata. To enable the Stemma user permissions required by the Tableau Metadata API, you will need to provide “View” and “Connect“” access to each data source and workbook that Stemma will read from.
Step 4: Create a Personal Access Token
Log in to your Tableau instance with the user you want to use for Stemma integration. If you created a new user for this integration, make sure you log in as that user.
Select My Account Settings from the dropdown of your profile.
Scroll down to the “Personal Access Tokens” section, enter the token name “stemmaViewer” and click “Create new token”.
A new pop-up will appear with the personal access token and secret. Make sure you copy these somewhere safe as they will not be displayed again.
Step 5: Provide the details to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose Tableau as the Connection type, provide a Connection Name, and then in the dialog box fill out the fields as follows:
Base URL Example: <https://sample.online.tableau.com>
Optional: You can opt to exclude or include some items by providing the following to Stemma by email or Slack:
Name of Projects to Exclude: If you want to exclude any sample projects or test projects from Stemma, provide the names of those projects.
External Assets Types to Include (Connection type):This is uncommon, but if you want to include and link only certain types of external assets with your Tableau workbooks in Stemma, provide the names of those assets you want to include and link. Note that this may limit the user experience of Stemma, and is not recommended.
Connecting to AWS S3
Stemma supports two methods of transferring files:.
Stemma creates an S3 bucket: you provide Stemma the ARN for a role/group that should have access to read to and write from the bucket. You then provide the role/group to the users or resources that need access to the bucket. Stemma recommends this method.
You create the S3 bucket: Stemma provides you the ARN of the role that the Stemma app uses. You will need to add the following policy to the bucket to allow Stemma access. Make sure you replace <BUCKET_NAME> and <ARN_FROM_STEMMA> with the appropriate values.
Stemma’s Athena integration supports: extracting metadata information (tables, columns, etc)
Proceed was follows to create an Athena connection to Stemma.
Create a user
To integrate with Stemma we recommend creating a new user in IAM that has access specifically scoped to the actions and resources required. The following script enables the required access:
Terraform
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 3.0"
}
}
}
provider "aws" {
# You are only creating IAM which are global, you may
# change the region but it should not impact this script
region = "us-east-1"
}
resource "aws_iam_user" "stemma_read_user" {
name = "stemma-read-user"
}
resource "aws_iam_user_policy" "stemma_read_policy" {
name = "stemma-read-policy"
user = aws_iam_user.stemma_read_user.name
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"athena:GetDatabase",
"athena:ListDataCatalogs",
"athena:GetDataCatalog",
"athena:ListDatabases",
"athena:ListTableMetadata",
"glue:GetDatabases",
"glue:GetTables"
]
Effect = "Allow"
Resource = "*"
},
]
})
}
Generate credentials
After creating the user, generate an AWS Access Key and Secret for the user.
Provide credentials to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose AWS as the Connection type, provide a Connection Name, choose Access Key as the Authentication type, provide the AWS Access Key and AWS Access Key Secret you have just created, and specify the AWS Region:
When you are finished, click Save Connection.
Testing the access
Optionally, you can validate the access to the catalog, databases and table that Stemma will have with the following code snippet; make sure to replace ACCESS_KEY and ACCESS_KEY_SECRET with the values for the Stemma Athena user:
import json
import boto3
from botocore.config import Config
client = boto3.client(
"athena",
aws_access_key_id="{ACCESS_KEY}",
aws_secret_access_key="{ACCESS_KEY_SECRET}",
config=Config(region_name="us-east-1"),
)
all_tables = []
for catalog in [cat['CatalogName'] for cat in client.list_data_catalogs()['DataCatalogsSummary']]:
for database in [database["Name"] for database in client.list_databases(CatalogName=catalog)['DatabaseList']]:
for table_name in [tbl['Name'] for tbl in client.list_table_metadata(CatalogName=catalog, DatabaseName=database)['TableMetadataList']]:
all_tables.append(f'{catalog}.{database}.{table_name}')
print('All tables available:')
print(json.dumps(all_tables, indent=2))
What Information Does Stemma Extract Automatically?
The table below shows what information you can expect to see in Stemma after you have integrated all of your applications.
Source
Metadata
User Usage
Lineage
Query Usage
Column Stats
Snowflake
✅
✅
✅
✅
✅
Redshift
✅
❌
❌
❌
❌
BigQuery
✅
✅
✅
❌
❌
Postgres
✅
❌
❌
❌
✅
Athena
✅
❌
✅
❌
❌
Delta Lake
✅
❌
❌
❌
❌
Metadata: tables and columns, descriptions. Where available, the time that the table was last updated. See also Sharing Additional Metadata.
Usage: which users are frequently running queries on a given table.
Lineage: how data moves between tables, views, and dashboards. See also Tracking Data Lineage.
Column stats: statistics (mean, min, max, null count, etc.) for columns in a table.
If you don’t want Stemma to ingest all the available data in the applications you integrate (for example if you want to ingest just one schenma from a data lake), you can tell Stemma what to ignore: Contact Stemma by email ([email protected]) or talk to your support rep.
If there's a Curated and an Ingested Description, Which Will I See in Stemma?
If Stemma ingests the description of a table or column (for example, from Snowflake) and you subsequently edit or replace that description in Stemma, Stemma users will see that curated description from then on, even if Stemma subsequently re-ingests the original Snowflake description (or any other version of it).
At present, Stemma does not provide a way for you to see both the Stemma description and the ingested description side by side.
What Happens when a Table or Dashboard Is Removed?
Removing a table or dashboard causes it, by default, to be removed from Stemma after ten days. Similarly, if the table or dashboard is renamed or moved, the original entry is removed from Stemma after ten days.
Before removing an asset, you should deprecate it to give stakeholders sufficient warning.
Stemma is a data catalog built for users of the modern data stack. It uses automation and workflow integration to help data owners get documentation into the catalog, while making it easy for end users to discover and trust data.
Tools for the Data Team to Document Data
Auto-generated metadata handles basic stewardship so no asset starts blank. Data that is automatically generated includes:
Column descriptions “autodescribed” via upstream column-level lineage
Kickstart documentation with automated data stewardship
Manage data changes with advanced lineage tools
Enhance existing user workflows by integrating the tools you already use
Integrating Your Applications with Stemma: What We Need from You
What we need to get you up and running depends on the applications you will be integrating with Stemma, but in general you will need to provide the information and do the tasks for Slack, SSO, and your dashboard and storage applications.
Once you have integrated your applications into Stemma and everything is tested and working, you need to plan the rollout to your users.
Guidelines for a Successful Stemma Rollout
Every Stemma rollout is different, depending on your organization’s particular needs, your data storage and management applications, and the people who use them in various ways and with varying priorities. So the following are guidelines, not rules, but they are based on the accumulated experience of Stemma’s founders and engineers, and are a good place for you to start.
Criteria for Success
The two criteria that matter in measuring the success of a data catalog are
adoption; and
customer satisfaction
If your rollout achieves both of these in a high degree, you can declare victory in the knowledge that you have improved the lives– and in particular the productivity, efficiency, and awareness– of your users and are contributing significantly to your organization’s success.
The following broad steps are designed to help you get there.
Steps for Rollout and Adoption
Step 1. Start with a persona and its use-cases
There are many user personas and use cases for a data catalog. Here’s a simplified view of the most common ones. It’s less important which persona you start with than that you choose a specific group of alpha users.
Step 2. Launch in phases
Phase 1: Identify a small set of tables to get alpha user feedback on.
This set can be your organization’s most commonly used tables (often referred to as “core” tables) or one domain within the company, such as marketing, growth or finance, etc. Core tables are often the best choice, because they have the most impact and there’s often a central data team which is responsible for maintaining them.
Phase 2: Populate MVP metadata on these tables.
Stemma can ingest the majority of this metadata automatically (see What Stemma Enables You To Do) leaving you and your users to curate only certain important items such as ownership, table descriptions and table status .* Where you must, to glean tribal knowledge, it may also help to do a “docs jam session” with a group of data producers and consumers. (You might even offer a reward such as a gift card for those who enter the most documentation!)
Phase 3: Alpha launch to 5-20 alpha users.
From the persona you decided on earlier, choose highly vocal users and those with the most tribal knowledge. These users will become Stemma’s avid supporters when you launch to a broader audience. (Note:Make sure these alpha users have the access they need to do work in Stemma (see SSO integration under What We Need From You) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake
Incorporate feedback and iterate. Look particularly for feedback that indicates opportunities for productive metadata propagation and cross-referencing (for example, someone might say something like, “Oh, we already have this metadata in this spreadsheet — we should pull that in here, too.” )
Tasks your alpha users should become comfortable doing include:
Search – Search for tables and dashboards using a Page-Rank style search
Phase 4: Beta launch to all users of the prioritized persona.
Focus your beta launch on your chosen persona (data consumers, for example). Don’t blur the focus of your launch by opening it up to all personas. Those other personas will come on board at GA, after your alpha and beta users have thoroughly learned and exercised the product and can serve as its advocates and in-house experts.
Best practices for achieving the greatest adoption include:
Update Slack channel headers where people ask each other questions. Stemma’s Slack integration links these conversations to the catalog automatically.
Embed Stemma training in new hire training. Tagging data sets by domain (marketing, growth, etc.) can help new hires quickly onboard to their domains. Showcase Sremma as an entry point into your existing training, for example by having all technical new hires instrument a metric during onboarding.
Create linkages with other products. Create links between various data tools. For example, auto-populate a link between an Airflow DAG that populates a table and the table page in Stemma (and vice-versa). (See Sharing additional metadata). Another productive linkage is between the table page in the data catalog and a link to the code that is used to generate the table.
Showcase the catalog at a group or company meeting. Deliver a short five-minute demo at a meeting that includes your target users. Educate, answer questions, and thank your alpha users — this creates more awareness and provides an opportunity for everyone to learn.
Step 4. Measure success
1. Adoption
Track WAUs (Weekly Active Users).
Start with Weekly Active Users (WAUs) rather than Daily Active Users or Monthly Active Users. Common usage frequency is weekly, not daily or monthly.
Target Penetration rate: 80%. A great penetration rate is 80% of WAUs within your target persona.
2. Customer Satisfaction (CSAT)
Measure CSAT periodically.
Send out a survey every three to six months asking users to rate their experience with Stemma.
Other metrics that organizations often consider include documentation quality, search quality, etc. But it’s best to stick to the above core metrics at the outset. As your Stemma implementation matures and ingests more metadata over time, you can begin to broaden your criteria.
Migrate from existing Amundsen
Migrating from Amundsen involves three main steps:
1 – Migrating existing metadata from Amundsen to Stemma
In this step, you take all the existing metadata from Amundsen and load it into Stemma. The metadata imported includes, but is not limited to:
Table and column descriptions
Tags
Ownership information
To import this metadata, upload a dump of your Amundsen neo4j metadata into Stemma:
Bring the Amundsen storage offline.
Use dump -> load, available on all neo4j installations, to dump and upload your data.
After import, your Stemma instance will be up to date with your Amundsen instance. However, Stemma has a significant number of additional features so the layout will appear different. Check out the quick start page here:
Multi-step column-level lineage on the Lineage tab of table details in Stemma
But your data won’t be periodically updated until you enable periodic updates; see the next step.
2 – Ingesting metadata continuously
You need to configure Stemma to ingest metadata updates on an ongoing basis. Once this configuration is done, you won’t need Python databuilder jobs to ingest data.
Information ingested in this step includes:
Table and column names
Linked issues (JIRA tickets)
Frequent users
Lineage, if it exists
IIn Stemma’s Admin interface, provide credentials to your data sources to allow Stemma access to extract metadata on an ongoing basis:
Stemma will now do the work to “upsert” metadata updates, as they occur, into the data you uploaded from Amundsen in the previous step.
3 – Cut Over
At this point, Amundsen and Stemma are running side-by-side, and Stemma recommends sharing access to Stemma with your power users and getting their feedback to ensure all of their use-cases are supported. Usually one week of overlap to obtain feedback and an additional week to incorporate feedback, if applicable, is sufficient.
When all looks good, simply redirect the URL of your internal Amundsen to the Stemma URL.
If you have any feedback or questions about moving from Amundsen to Stemma integration, please reach out to [email protected]!
Using Stemma
Here you’ll find instructions on how to use all the features in Stemma.
Using the Stemma Bot in Slack
Stemma’s goal is to capture information about data no matter where it happens. Slack is increasingly where conversations about data exist, and Stemma has a super simple integration that allows you to pin Slack conversations to tables.
Here’s how it works. If someone asks a great question, or posts a great answer about a piece of data in Slack, you can open a thread and tag the Stemma bot:
`@Stemma schema_name.table_name`
(Note: if schema_name.table_name is not unique in your warehouse, you can always use db_name.schema_name.table_name)
You can now go to the Table Usage page in the Stemma UI to see the linked conversations:
Where can you use the Slack bot?
You can use it in public or private channels. You just need to invite the bot into the channel. It cannot be used in direct messages.
Some other things to keep in mind about the integration:
You must respond in a thread
You need to give a precise definition. If a table is only present in one schema, “@Stemma table_name” will work just fine. But if it’s present in multiple schemas, you’ll need to use “@Stemma schema.table_name”
If you have any feedback or questions about the Slack integration, please reach out to [email protected]!
Messaging Users
Stemma provides a quick and easy way to send messages to specific groups of users on the basis of their relationship to a table (e.g. are they an owner in Stemma, have they queried the table in the prior three months, etc.). If you need to announce a future change, for example, or notify users when significant events occur, such as when a table has been deprecated, you can use Stemma messaging to ensure that the affected users will receive the notification.
You can compose a custom message to users and teams related to a table by clicking the Message button on the top-right corner of the Table detail page
There are three types of notifications you can send in Stemma, depending on the receivers and your personal preferences. If a member of a related group is a user, they can receive either email notifications or Slack private messages. If a member of a related group is a team (Slack channel), they will receive a message in Slack.
Slack notifications
After you have completed our Slack Integration, you can send and receive messages from the Stemma bot.
Messages in channels
If a member of a related group is a team, they will receive a message from the Stemma bot in the corresponding Slack channel.
📘 Note
Make sure you add the bot to all channels that need to receive messages.
Slack private messages
You can choose to receive notifications from Stemma as Slack private messages. To receive messages from the bot, you need to add the Stemma application to your Slack configuration.
To modify your preferred notification method, contact your Stemma representative.
User groups
Each group contains a list of users and teams (Slack channels). In Stemma there are four different types of user groups related to a table:
Stemma will only show the options that apply to the given asset. For example, if a user is sending a custom message for a table that does not have any downstream lineage there will not be an option for Owners of derived tables.
Using Tableau with Stemma
Stemma works with Tableau to help answer common questions data analysts have, such as:
Stemma indexes Tableau dashboards along with their underlying tables, providing a one-stop search for all data assets. Stemma also tracks how many times a dashboard has been viewed, instantly highlighting which is most commonly used.
Is the right data being used?
Stemma automatically correlates your Tableau dashboards to the corresponding table asset by reconciling the table names extracted from the Tableau Metadata API with the existing table metadata Stemma has captured from your warehouse. This allows you to start your exploration by finding trustworthy reports and, in turn, using the lineage to discover which tables are useful to you.
You can also explore the dashboard’s lineage upstream to get an end-to-end view of where the data came from.
How can I use the metrics in this dashboard?
Tableau provides Stemma with the SQL that it creates to generate each sheet, and Stemma indexes this SQL so you can easily find the code. You can copy and paste the code into your IDE to jump-start the SQL exploration.
Best Practices for Data Ownership
👍 The three types of ownership that work well
It works well to break down ownership of the most important date into three distinct roles:
Delivery owner — Ensures that this particular data gets delivered on time, on a promised delivery SLA. Usually, this is a data engineer or analytics engineer responsible for developing and maintaining the pipeline that produces this data.
Domain owner — What does this particular value in a field (or column) mean? When does this particular event get triggered? Usually, this is a product engineer who created the event or the analysts and data scientists who use this data most frequently and understand the physical reality that the data represents.
Policy owner — Ensures that this data gets used according to the classification and terms associated with it. Sometimes you acquire data from a source that should not be used for a certain category of use-cases. For example, YouTube is permitted to show ads to kids, but not permitted to personalize them. Therefore, the personalization data can’t be used if the subject is a child. The person making these calls is usually not an engineer or data scientist, but someone on the Policy or Privacy team at the company.
👎 Types of ownership that often don’t work as well
Overall Quality owner: In practice it’s often hard to find people able and willing to take ownership of the end-to-end quality of a data set. This is because data engineers don’t consider themselves owners of the data that’s produced from the upstream application and don’t want to be responsible for hunting down a website bug that impacts the data in the warehouse. Product engineers, on the other hand, don’t have enough context about how data gets joined and transformed downstream to own the final derived data artifact. This may change as decentralized data management (or data mesh) is deployed more broadly.
Shared ownership: This often does not work in practice, though it can work well if the owner group shares a good understanding of the different types of ownership and the group can efficiently redirect each question to the right person.
🏥 The data engineer’s role
In practice, the data engineer plays the role that a triage nurse would play in an Emergency Room. When an issue arises (like a patient arriving in the ER), the data engineer triages to see what’s going on. Sometimes it’s a problem the engineer can fix), so they fix it and resolve the issue (like the nurse treating an injury on the spot). In other cases, the engineer redirect the issue to the appropriate owner (as in a referral to a doctor or other health practitioner). In short, the issue determines who the owner should be.
Using Stemma
Use Stemma to assign and document ownership. Divide ownership as follows:
Stemma’s advanced search helps you quickly narrow your criteria and find the data assets you want to see:
Search bar: The search bar allows you to enter free-form text to search across nearly every piece of metadata in your entire data ecosystem. This includes table names, descriptions, column names, column descriptions and more.
Filtering: You can use the following powerful filters, separately or together, to narrow your search to find the assets you need:
Resource filtering: allows you to choose what type of resource to return in the search results; options are Tables, Dashboards, and Users. By default, Tables and Dashboards are selected.
Asset status filtering: filters results to only include the relevant statuses. The options are Certified, No Status, Deprecated, and Intermediate. The default options are Certified and No Status.
Table filters: these apply only to the Tables results. If you have another resource type besides Tables selected in your resource filter, the values used in the table filters will not affect those other results. Table filters allow you to filter results by the data source, database, schema and table name. You can also filter by column; this will return any table that contains that column name.
Dashboard filters: these apply only to the Dashboard resource results. If you have another resource types besides Dashboard selected in your resource filter, the value used in the dashboard filters will not affect those other results. Dashboard filters can be used on the product, folder or dashboard name.
Table & Dashboard filters: These can be applied to both Tables and Dashboards resource types. You can filter both Tables and Dashboards by the tag that is associated with them.
Advanced filtering
By default, all filters look for an exact match for the value you provide. But the Table, Dashboard and Table & Dashboard filters allow for advanced wildcard searches.
To use a wildcard search, you can place one or more asterisks * in the filter box. Here are a few examples:
Filter to include only schemas that start with marketing_
Filter to only include schemas that end in _prod
Filter to only include schemas that have _user_ in them
You can also combine multiple filters together!
For example, you might want to find all tables in the marketing schema that begin with marketing and end with raw–but only tables that include the column order_id.
Understanding the Results
Stemma’s advanced search naturally ranks your search results based on their actual usage. This means the more users run queries on a table the higher the rank that table is likely to have in your search results. In addition, the result list shows you additional reasons why your assets were returned.
This example shows the search results for the search term “orders”:
Points to note:
Any words in the table or dashboard description that match the search term are highlighted
Any column names that have an exact match or a similar match are listed to the right
Any column description that matches the search term is displayed on the right
Assigning Ownership of Tables
You can assign ownership of a single table to individuals or teams. You can also assign owners to a set of tables in bulk, adding all the owners in a single transaction.
Note: You can assign owners to tables in Stemma, but not to dashboards. Stemma ingests ownership of dashboards from the underlying BI tool(s) and you need to make updates and modifications in those tools.
The importance of ownership
Ownership is one of the fundamental pillars of a data catalog. It defines responsibility by individual users and teams, adding tremendous value to the data in the catalog, including:
Delegation of duties: Which users are responsible for keeping this table properly documented?
Compliance with processes and procedures: Has the right team updated the documentation for this table before the newest pull request is merged?
Reporting and analytics: Which teams have the highest percentage of their assets documented? The lowest?
You can assign ownership of a table either to individuals or teams:
Individuals are the people in your company who use Stemma. You can find individuals by their email address or a Slack handle.
Teams are represented in Stemma by Slack channels.
To assign one or more owners to an individual table, choose the table in the Stemma UI and click on Add Owner at the bottom of the left panel.
But Stemma also allows you to bulk-assign table ownership across the catalog. You can do this in a single transaction, immediately allowing your users to identify the assets that belong to them.
Assigning owners in bulk
You may think about ownership as a hierarchy, or your teams may typically own all of the tables within a single database or schema, or perhaps your analysts use Dbt to create their own workflows and create tables with specific naming conventions (e.g. marketing_sales, marketing_prospects). Regardless of how your company thinks about grouping assets, Stemma can help you to bulk-assign owners, using advanced search to find any of the assets you want to update and assigning the owners with a single command. To assign ownership of tables in bulk, proceed as follows:
From the home page of the Stemma UI, navigate to Advanced Search.
At the top of the left panel, make sure only the Tables resource is checked (you can’t bulk-assign owners to dashboards).
Optionally use the filters in the left panel to limit the search results to the set of tables to which you want to bulk assign ownership:
Stemma suggests you limit your search results using only the filters on the left; while you can also use the search bar at the top, doing so makes it more likely that tables that you don’t want included slip into the results
Click on Bulk Edit at the bottom right of the page and choose Add Owners.
In the dialog box, enter or select an email address or Slack channel and choose Add.
To add more owners, repeat Step 5 for each new owner.
When you are finished, click Add Owners and then Done.
Stemma automatically pulls in the teams you’ve defined as well as the list of users in your organization:
Note: When you assign ownership in bulk, this does not trigger automatic messages to the owners and users; that only happens when you change to assign ownership individually.
Displaying Data Lineage in the Stemma UI
Navigating the Lineage Graph
Column filtering
From the root node, search for columns of interest using the search bar in the right hand node panel
Click on columns that have a filter icon to prune the graph to only column-level relationships for the selected column
To examine columns in a related node, select that node
Below the search bar in the node panel you can select between “Filtered” and “All columns” to view the appropriate list of columns for the selected node
To filter by a different column, simply select the root node and then select any other column that has a filter icon
To remove the filter, simply select the root node an then re-select the filtered column
Focus mode
Select any node in the graph to initiate focus mode and make it easier to see tables and dashboards related to the selected node and the root node
When a collapsed node is part of the graph, it will automatically show the relevant table or dashboard with an expand icon
Click the expand icon to break the table or dashboard out of its collapsed group
Dashboards
When the selected node is a dashboard, the right-hand node panel will show charts and queries instead of columns
If the dashboard is the root of the table, any chart with lineage will display the filter icon which can be clicked to show only upstream data sources for that chart
Finding Tabular Lineage for Columns
End-to-end lineage
Navigate to the details page for the table
Click on the Lineage tab near the top of the page
In the Filter by column dropdown menu near the top of the Upstream or Downstream section, select the column with lineage that you are interested in
The table will now show related tables and columns alongside their distance from the current table
Simple
Navigate to the details page for the table.
Click on About tab near the top of the page.
Click on the up-arrow next to the column whose lineage you want to see
The row will expand to show column details, including the number of immediate upstream and downstream relatives of the column
Click the upstream or downstream links to see a pop-up table of the related columns
Evaluate data – For example, if you are not seeing the results you expected from a given table or dashboard, the problem may be one of timing: use the UI to check all upstream tables and make sure they ran when they should. To make this even easier, you can filter the graph to trace a specific column’s lineage.
Plan changes: The interconnected nature of data products– with multiple products often chained together to build upon each other– means that even the smallest change in a pipeline can ripple downstream and disrupt your business. These small but potentially disruptive changes are frequent and necessary:
An analytics engineer builds a table using Dbt; that table becomes popular and several dashboards are built on top of it. It becomes well trusted, and the logic becomes a standard for calculating certain metrics. But the table is missing several important entity IDs (foreign key columns) which makes it difficult to join to other tables. The analytics engineer and a data engineer work together to complete a “micro migration” to a new, more performant and reusable table.
A machine-learning model is used to predict churn for a customer. Each day after the model runs, users that are predicted to churn are identified in a table in Snowflake and then sent an email to re-engage them. The product team is launching the second major product and wants to measure churn on a per-product basis. The engineering team updates their data pipeline to add a new column to the output table that identifies the product that churn is being predicted for.
An analyst created several views to consolidate the common logic that the team were using to query tables in their dashboards. But the performance of the views has deteriorated as the amount of data has grown. The analyst and the engineer work together to create materialized versions and to update all of the dashboards using the prior views.
Use Data Lineage in Stemma to Identify Affected Assets
The challenge with such changes is that it is hard to accurately curate a list of all assets that will be affected. If you cannot be certain of what will need to be changed or what could break, you cannot do risk analysis or even make a time estimate. Stemma enables you to do the analysis and create the appropriate messaging.
The key question for this type of change is “What tables and dashboards will I need to update if I rename this view?”
Stemma provides a lineage graph for each table and dashboard. From here you can see who and what is utilizing your resources downstream, and how popular your assets are. Data owners can use this in conjunction with their own knowledge to build roadmaps and plan how data should evolve.
Use Stemma Messaging to Communicate Expected Changes
Everyone affected needs to know what is going to change and when. After building your list of affected assets and their users, use Stemma messaging to provide advance notice and instructions.
How Stemma Discovers Data Lineage
When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.
In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
Note
When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
Along with a direct connection to various data sources, Stemma can ingest data using extracts in various file formats. One of the most commonly used file formats is CSV.
Below are some of the CSV Formats Stemma ingests today (along with the sample files), with the column names and values we expect in each file. Note that you need the column header literals (for example,db_name, etc.) as well as the values.
Github Links
Used to provide a link to a Github source from Stemma’s Table detail page.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. source: the Github URL for the repository source_type: static value, “github”
Airflow
Used to generate a link between a table and an Airflow task. Once the data is ingested, you will see an “Airflow” button on the Stemma detail page for the table.
task_id:: the Airflow ID for the task dag_id: the Airflow ID for the DAG exec_date: the most recent execution timestamp for the task application_url_template: the URL for the DAG in your Airflow instance db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. schema: the name of the schema within the database. table_name: the name of the table within the schema. cluster: the name of the database within the data source.
Dagster
Used to link a Dagstr op and/or job to a table in Stemma. Once the data is ingested, you will see an “Dagstr” button on the Stemma detail page for the table.
application_url_template: the URL for the DAG in your dragster instance db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema.
Table Tags
Stemma can create and assign tags via a CSV file, with the following format:
db_name
cluster
schema
table_name
tags
hive
gold
test_schema
test_table1
“tag1,tag2”
hive
gold
test_schema_2
test_table2
“tag1,tag3”
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. tags: the tags to be created. Tags are additive: existing tags on tables will not be modified.
Table Description
db_name
cluster
schema
table_name
description
hive
gold
test_schema
test_table1
Description for table, optionally markdown.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. description: the table description. Note that descriptions are destructive: if a description already exists, it will be overwritten.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. owners: email addresses or slack channels. Owners are additive: existing owners of tables will not be modified.
Column Descriptions
db_name
cluster
schema
table_name
col_name
description
hive
gold
test_schema
test_table
test_column
This is an example column description.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. col-name: the column described. description: the column description. Note that column descriptions are destructive: if an existing description is present, it will be overwritten.
Finding Important Information about Tables and Columns
Stemma provides out-of-the-box integration with applications such as Snowflake, dbt, and BI tools such as Looker, Mode and Tableau to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, column-to-column lineage, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.
The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.
The table details page provides following information:
About
Tags
Descriptive tags for cross-referencing, such as covid, cases, etc.
Asset status
This can be Certified, Deprecated, Intermediate, or No Status
Owners
The person or people responsible for maintaining this table
Click on an owner to see other assets owned by that user, as well as their bookmarks and frequently used tables. You can also email the user via this profile page.
The date and time of the last update to the table (this can be different from the most recent date covered by the data itself)
Date Range
The date range covered by the data in the table.
Columns
Descriptions are automatically ingested from a source (like dbt or Snowflake). If there is no description to ingest, Stemma will automatically use column-level lineage to search upstream for a description in an upstream column with the exact same name. This documentation is shown as “Autodescribed.”
Column descriptions can be created, and autodescribed documentation over-written, within Stemma. Also available in this table are the data type, related glossary terms and badges.
Lineage
dbt Compiled Query
For dbt users, see the query that formed the table, as well as the dbt model type.
Upstream, Downstream
Upstream and downstream tables and dashboards, sorted by popularity.
Where possible, Stemma captures the information directly from the source. This table shows which attributes are populated from the source and which you can edit:.
Table Metadata
Automatically captured?
Users can edit?
Table description.
✅
✅
Asset status.
❌
✅
Owners
✅ (dbt)
✅
Tags
✅ (dbt)
✅
Jira issues
❌
✅
Last updated time
✅
❌
Data date range
✅
❌
Commonly used with tables
✅
❌
Frequent users
✅
❌
Upstream & downstream table lineage
✅
❌
Dashboards that use the table
✅
❌
Slack threads with table mentions
✅
❌
Advanced Description Editing
The Stemma UI provides a WYSIWYG editor for descriptions of tables and columns.
Using the editor
To begin editing, navigate to the table or column whose description you want to edit, and click on the pencil icon next to Description:
This opens up the text box for editing:
Add or change the description.
Descriptions can include links and images as well as text.
You can choose font styles as indicated by the buttons.
Use the pull-down menu to toggle the text format from the default (Normal) to common formatting elements such as headings, lists, quotes, and code blocks.
When you choose Code Block, you can choose the type of code from the secondary pull-down menu (the default is JavaScript).
You can add hyperlinks: enter and highlight the text to display, then click the 🔗button and enter the URL (and edit it if necessary) in the text box next the button.
The editor also supports Markdown syntax, as well as any shortcuts your system supports (such as ⌘-b for bold on a Mac)
When you are finished, click Update to save your changes.
Linking to other tables from the description
This is useful, for example, if you want to give other users more information about the relationship between the tables.
To link to another table from the description, do the following:
In the Stemma UI, navigate to the table whose description you want to edit.
Click on the pencil icon next to Description.
Enter the new desription providing the related table’s<schema_name>.<table_name> anywhere in the text.
Click Update to save your changes.
The new description will contain the link to the related table:
Sharing column descriptions
You can share a column description with all columns of the same name across the catalog. To do this:
In the Stemma UI, navigate to the table whose column description you want to share.
To add or modify a description, click on the down-arrow to the left of the entry for the column:
Hover over the word Description and click on the pen that appears:
Add or update the description and click on Update. Stemma automatically discovers all columns with the same name, and the description propagates to all linked columns. To see which columns are linked, and add and remove links, click on the three dots to the right of the column entry:
Click on the Edit linked columns pop-up. In the dialog window, you can:search for and add columns to link, and remove existing links.
The Stemma UI makes it easy for you to discover this important information about your dashboards, and provides a convenient way to add and modify curated information.
Status
This can be Certified, Deprecated, Intermediate, or No Status
Description
A brief summary of the contents of the dashboard:
Owners
The person or people responsible for maintaining this table:
This information is extracted automatically from the source and (unlike table ownership) can’t be edited in Stemma.
Tags
Descriptive tags for cross-referencing
Contents
Total Views
Total number of views over the life of the dashboard
Created
The date and time the dashboard was created
Last Updated
Date and time of the last update to the dashboard
Last successful run
Date and time the dashboard was last run successfully
Last run
Date and time the dashboard was last run (successfully or not)
Tables
Tables used by the dashboard
Charts
Charts available from the dashboard
Queries
Queries run from the dashboard. Click on “View Query” to see the code and copy it to your clipboard.
Open folder
Open the dashboard’s folder in the underlying application (you may need to log in to the application)
Open dashboard
Open the dashboard in the underlying application (you may need to log in to the application)
Using Admin Functions and Reports
API Keys
Allows you to add and delete keys for the Graphql API.
These reports provide an account of your organization’s interactions with the Stemma catalog, including curating work that needs to be done, as well as usage.They can be especially useful when you are rolling our Stemma to your users and scoping the work remaining to do.
Customer Summary Shows the number of data sources, tables, and dashboards that are integrated into the Stemma catalog.
Monthly product usage Shows the number of users and searches by month.
Documentation summary Shows the percentage of columns, dashboards, and tables missing documentation and/or ownership.
Assets without docs Shows the specific tables and dashboards that lack documentation. Click on the drop-down to switch between the two types of asset
Assets without ownership Shows the specific tables and dashboards that lack ownership (click on the drop-down to switch between the two types of asset).
How Stemma Captures Data: SQL Parsing Overview and Caveats
What Stemma captures
When you execute SQL against your warehouse, the database captures the raw SQL that was executed and saves it in a historical format that includes the time that the SQL command was executed as well as the user who executed it. Stemma reads these SQL logs, along with access history, and parses all of the SQL commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page:
Table and column lineage: any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
User table usage: Stemma uses select statements to associate users to the table(s) being queried; this provides the Frequent Users information on the Table Details page, search rankings, and the user’s profile page.
Commonly Used With: any joins that occur within the SQL, whether on the top-level query or as part of a sub-query, are captured and presented on the Table Details page to help other users understand how two or more tables are commonly joined.
Caveats:
When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
When user IDs in the database do not match the user portion of the user’s email address as recorded in Stemma, Stemma cannot associate the Stemma user to the user in the query logs. This prevents Stemma from creating any user-table usage information.
When system accounts are used to query the data, Stemma is often not able to see which user is executing the query. This is a common practice with BI tools when multiple team members share the same connection to a database and that connection uses a system account.
However, if you are using Mode, Stemma can parse the comments appended by Mode (which do contain the user ID) to find the user and associate that user to the query.
Commonly Used With does not capture joins where the left or right table is a dynamically generated select statement, for example: select * from table_a a join (select * from table_b b1 join table_c c1 on b1.id = c1.id) b on a.id = b.id.
Planning Changes to Tables or Columns
When you are planning to make a backward-incompatible change, you need to proceed in a way that minimizes the impact and gives downstream stakeholders as much warning as possible. Examples of backward-incompatible changes are:
Changing a schema (type changed, name changed)
Changing values (data fields will no longer be populated, or will be populated differently)
You need to understand and communicate the impact before you make such a change. This is usually done in the following steps:
Understand the impact: understand who uses or queries this data.
Provide heads-up notification: notify downstream owners and users that this data is going to change, and that they need to update their assets (downstream tables or dashboards) accordingly. Provide a time window after which the change will go into effect.
(ideally): Understand the impact again: see if downstream users have changed their assets; how many still need to be nudged, etc.?
Make the actual change and follow up to make sure that it’s complete.
Stemma can help you make the change as painless as possible. Use the UI to assess the effect on downstream users and on other assets. The following sections provide help:
Now warn all owners and frequent users about the change, and give them a deadline to make their adjustments.
Note
If you decide to change the name of a table to indicate that it’s deprecated (e.g., to something like deprecated_tablename), that will cause it, by default, to be removed from Stemma after ten days.
Using the External GraphQL API
Creating and Installing an API Key for the External GraphQL API
1. Create an API key in the Stemma UI:
a. Navigate to the Admin panel and choose Add API Key:
b. In the pop-up window, enter a name for your API key and click Submit. You can choose any name, but Stemma recommends using a name that describes the intended use of the key (e.g., github action integration):
Stemma generates an API key that appears in the window. Save this key somewhere secure. It is not recoverable.
Now click Docs in the upper right of the Altair UI; you should see the graphql schema on the right hand side:
Getting Started with the Stemma Business Glossary
What Is a Business Glossary?
A business glossary is a repository of business terms and their definitions that an organization uses to ensure the same definitions are used company-wide when analyzing data. It allows data analysts, business users, and data engineers to agree on the definition of real-world terms, and how those are represented in the data; and serves as the source of truth for those definitions
Used this way, the glossary produces a common vocabulary that everyone in the organization uses. This common vocabulary is a key component of data governance, ensuring that everyone means the same thing when they use the same term. For instance, one department may define the term “customer” as a company while another department may use the term to refer to an individual. A business glossary prevents such discrepancies.
In short, a business glossary provides canonical definitions of common terms and fosters shared understanding of those terms throughout the organization.
The Stemma Business Glossary
This first version of the Stemma Business Glossary allows you to define and group terms that are used in your organization and assign owners to those terms so that the definitions remain consistent and up to date. These terms can be specific to the data sources in the Stemma catalog, or they can be terms that are used throughout the organization or any part of it.
You can also link those terms to table columns in the Stemma catalog.
In future, you will also be able to link terms to other resources in the catalog, and to queries and search results.
Using the Glossary
Populating the Glossary
To begin populating the glossary, proceed as follows.
From the navigation bar of the Stemma UI, choose Glossary:
Choose + New Group or + New Term.
In the dialog window, add a name and a definition or description.
You can use Markdown syntax for the description or definition. – A group is a collection of related terms. For example, you might want to group terms by department or function– finance, marketing, etc. – You can add a term to an existing group, or create a new term without assigning it to a group. (You can’t create a group on the fly while adding or editing a term).
– You can add related columns and one or more owners: start typing to find candidates.
When you are finished, click Save.
Updating the Glossary
To update the glossary:
From the navigation bar of the Stemma UI, choose Glossary.
To modify a group, click on the group and then click Edit. You can change the group’s attributes and also add new terms to it.
To modify a term, search for it by name, or click the > next to the group and select the term from the list;. then either:-
Click Edit and make your changes to any or all attributes of the term. You can change the group the term belongs to, so long as the new group already exists.
OR
Click Delete to remove the term from the glossary.
When you are finished, click Save.
Frequently Asked Questions (FAQs)
What Information Does Stemma Extract Automatically?
The table below shows what information you can expect to see in Stemma after you have integrated all of your applications.
Source
Metadata
User Usage
Lineage
Query Usage
Column Stats
Snowflake
✅
✅
✅
✅
✅
Redshift
✅
❌
❌
❌
❌
BigQuery
✅
✅
✅
❌
❌
Postgres
✅
❌
❌
❌
✅
Athena
✅
❌
✅
❌
❌
Delta Lake
✅
❌
❌
❌
❌
Metadata: tables and columns, descriptions. Where available, the time that the table was last updated. See also Sharing Additional Metadata.
Usage: which users are frequently running queries on a given table.
Lineage: how data moves between tables, views, and dashboards. See also Tracking Data Lineage.
Column stats: statistics (mean, min, max, null count, etc.) for columns in a table.
When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.
In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
Note
When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
Apart from Administrator (see What Can an Administrator Do?), Stemma does not at present support a variety ofuser roles. This means that every user has edit capability; that fosters adoption and comprehensive use, allowing the broad community to get familiar with Stemma and learn its full range of capabilities.
As our user base becomes becomes more expert with Stemma and starts to specialize, we expect to introduce user roles.
What Can an Administrator Do?
To become a Stemma administrator, or nominate one, contact Stemma support ([email protected]).
How Can I Tell What Stemma Automates and What I Need to Curate?
Automated documentation
Stemma provides out-of-the-box integration with applications such as Snowflake, dbt, and BI tools such as Looker, Mode and Tableau to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.
The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.
Stemma also allows linking of existing Slack conversations about data to the data catalog.
Curated documentation
Items you need to curate include editable descriptions, tags, and owners. All of these can also be modified by APIs. APIs can also ingest custom metadata.
In the Stemma UI, items that you can curate will show a pencil when you hover your cursor over them; for example:
At present you need to contact Stemma support ([email protected]) with the names of the users you want to add. Stemma will add them for you, and you then need to authenticate them, making sure they have the access they need to do work in Stemma (see the sections on integrating SSO apps listed here) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake.
How Do I Update an Existing Connection?
Updating a Connection
At present you need to add a new connection and delete the old one. You need Stemma’s help to do this.
First add a new connection as follows:
In the Stemma UI, choose Admin from the toolbar at the top.
Click the button for Add a New Connection.
Do not try to edit the existing connection.
In the pop-up window, choose the connection type and provide the information as prompted. See What We Need From You.
Click Submit.
This creates a new connection that is not yet active.
Now, contact Stemma support ([email protected]) with a request to activate the new connection. Stemma will let you know when the connection is active.
Once you have verified that the new connection is active and working properly, you can delete the old connection:
In the Stemma UI, choose Admin from the toolbar at the top.
Click the Delete button next to the connection you want to delete, and in the pop-up window confirm that you want to delete it.
How Do I Link a JIRA to Stemma?
To link a JIRA issue to a table in the Stemma catalog, follow these instructions.
What Effect Does Deprecating a Table Have?
You can set the status of a table or dashboard as deprecated or intermediate. When you do so, that asset remains in the Stemma catalog, but is removed from default searches (you can change this by means of filters in Advanced Search).
When you deprecate an asset or change its status to intermediate, you should modify its description accordingly, and you should also be careful to message all users and owners of the asset (including users who might be using the asset indirectly, such as dashboard users of a table) to warn them about the change in status. This allows you to make sure you hear from all stakeholders before you remove or move the asset.
How Does Stemma Calculate "Top 5 Columns", "Frequent Users", etc?
The table details page in the Stemma UI provides usage information, derived as follows:
To calculate the top five upstream and downstream columns, Stemma multiplies the number of unique users by the number of queries to get the ranking. The time frame is 30 days.
To identify tables commonly usesd with this one, Stemma counts the joins that were part of any query in the past 90 days.
To identify frequent users of a table, Stemma counts any user who has run a query on this table in the last 30 days.
On the Azure Active Directory page, click Enterprise applications.
Click New application on the top menu.
Click Create your own application.
Enter a name for your application, we suggest “Stemma Data Catalog“.
Select Register an application to integrate with Azure AD (App you’re developing).
Click Create. This should redirect you to Register an application.
Under Supported account types, select Accounts in this organizational directory only (Single tenant).
Under Redirect URI (optional), select Web. If you are running a hosted version of Stemma, the redirect URI should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name. In case of on-premise deployment, please consult your Stemma representative for this.
Click Register.
Create a client secret
To create a client secret, follow these steps:
Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
Select your newly added application. Make a note of the Application ID.
Click Certificates & secrets on the left-hand side and click New client secret.
Provide a description and an expiration length that follows your security organization’s guidelines.
Click Add.
Copy the Value corresponding to the newly-created client secret and client secret expiration. This is the Relying Party OAuth Client Secret. This secret value is never displayed again after you leave this page.
Set up application permissions
You need to set up two application permissions in your Azure portal so that you can use all of the Azure AD synchronization options in Stemma. You need to set up the following permissions:
Azure Active Directory Graph Directory.Read.All
Microsoft Graph Directory.Read.All
To set up permissions, do as follows:
Go to Azure Active Directory and click App registrations.You must go to the top level in your Azure portal and then select Azure Active Directory. You can then select App registrations.
Select your newly added application.
Click API permissions on the left-hand side and click Add a permission.
Click APIs my organization uses and click Windows Azure Active Directory.
Select the Azure Active Directory Graph > Directory.Read.All permission. a. Click Application permissions. b. Under Directory, click Directory.Read.All. c. Click Add permissions.
Select the Microsoft Graph > Directory.Read.All permission. a. Click Application permissions. b. Under Request API Permission, click Microsoft Graph. c. Under What type of permissions does your application require?, click Applicationpermissions. d. Under Directory, click Directory.Read.All. e. Click Add permissions.
Click Grant admin consent for and then click Yes. You should see a message saying that you’ve granted consent for each of your permissions.
What We Need From You
Provide Stemma with the following information by email or Slack:
Application ID
Client Secret
Discovery Document Endpoint (Well Known OpenID Configurations)
Tenant ID – The ID of the tenant you created the application in.
Guidelines for a Successful Stemma Rollout
Every Stemma rollout is different, depending on your organization’s particular needs, your data storage and management applications, and the people who use them in various ways and with varying priorities. So the following are guidelines, not rules, but they are based on the accumulated experience of Stemma’s founders and engineers, and are a good place for you to start.
Criteria for Success
The two criteria that matter in measuring the success of a data catalog are
adoption; and
customer satisfaction
If your rollout achieves both of these in a high degree, you can declare victory in the knowledge that you have improved the lives– and in particular the productivity, efficiency, and awareness– of your users and are contributing significantly to your organization’s success.
The following broad steps are designed to help you get there.
Steps for Rollout and Adoption
Step 1. Start with a persona and its use-cases
There are many user personas and use cases for a data catalog. Here’s a simplified view of the most common ones. It’s less important which persona you start with than that you choose a specific group of alpha users.
Step 2. Launch in phases
Phase 1: Identify a small set of tables to get alpha user feedback on.
This set can be your organization’s most commonly used tables (often referred to as “core” tables) or one domain within the company, such as marketing, growth or finance, etc. Core tables are often the best choice, because they have the most impact and there’s often a central data team which is responsible for maintaining them.
Phase 2: Populate MVP metadata on these tables.
Stemma can ingest the majority of this metadata automatically (see What Stemma Enables You To Do) leaving you and your users to curate only certain important items such as ownership, table descriptions and table status .* Where you must, to glean tribal knowledge, it may also help to do a “docs jam session” with a group of data producers and consumers. (You might even offer a reward such as a gift card for those who enter the most documentation!)
Phase 3: Alpha launch to 5-20 alpha users.
From the persona you decided on earlier, choose highly vocal users and those with the most tribal knowledge. These users will become Stemma’s avid supporters when you launch to a broader audience. (Note:Make sure these alpha users have the access they need to do work in Stemma (see SSO integration under What We Need From You) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake
Incorporate feedback and iterate. Look particularly for feedback that indicates opportunities for productive metadata propagation and cross-referencing (for example, someone might say something like, “Oh, we already have this metadata in this spreadsheet — we should pull that in here, too.” )
Tasks your alpha users should become comfortable doing include:
Search – Search for tables and dashboards using a Page-Rank style search
Phase 4: Beta launch to all users of the prioritized persona.
Focus your beta launch on your chosen persona (data consumers, for example). Don’t blur the focus of your launch by opening it up to all personas. Those other personas will come on board at GA, after your alpha and beta users have thoroughly learned and exercised the product and can serve as its advocates and in-house experts.
Best practices for achieving the greatest adoption include:
Update Slack channel headers where people ask each other questions. Stemma’s Slack integration links these conversations to the catalog automatically.
Embed Stemma training in new hire training. Tagging data sets by domain (marketing, growth, etc.) can help new hires quickly onboard to their domains. Showcase Sremma as an entry point into your existing training, for example by having all technical new hires instrument a metric during onboarding.
Create linkages with other products. Create links between various data tools. For example, auto-populate a link between an Airflow DAG that populates a table and the table page in Stemma (and vice-versa). (See Sharing additional metadata). Another productive linkage is between the table page in the data catalog and a link to the code that is used to generate the table.
Showcase the catalog at a group or company meeting. Deliver a short five-minute demo at a meeting that includes your target users. Educate, answer questions, and thank your alpha users — this creates more awareness and provides an opportunity for everyone to learn.
Step 4. Measure success
1. Adoption
Track WAUs (Weekly Active Users).
Start with Weekly Active Users (WAUs) rather than Daily Active Users or Monthly Active Users. Common usage frequency is weekly, not daily or monthly.
Target Penetration rate: 80%. A great penetration rate is 80% of WAUs within your target persona.
2. Customer Satisfaction (CSAT)
Measure CSAT periodically.
Send out a survey every three to six months asking users to rate their experience with Stemma.
Other metrics that organizations often consider include documentation quality, search quality, etc. But it’s best to stick to the above core metrics at the outset. As your Stemma implementation matures and ingests more metadata over time, you can begin to broaden your criteria.
Can I Choose What Data Stemma Ingests?
If you don’t want Stemma to ingest all the available data in the applications you integrate (for example if you want to ingest just one schenma from a data lake), you can tell Stemma what to ignore: Contact Stemma by email ([email protected]) or talk to your support rep.
Messaging Users
Stemma provides a quick and easy way to send messages to specific groups of users on the basis of their relationship to a table (e.g. are they an owner in Stemma, have they queried the table in the prior three months, etc.). If you need to announce a future change, for example, or notify users when significant events occur, such as when a table has been deprecated, you can use Stemma messaging to ensure that the affected users will receive the notification.
You can compose a custom message to users and teams related to a table by clicking the Message button on the top-right corner of the Table detail page
There are three types of notifications you can send in Stemma, depending on the receivers and your personal preferences. If a member of a related group is a user, they can receive either email notifications or Slack private messages. If a member of a related group is a team (Slack channel), they will receive a message in Slack.
Slack notifications
After you have completed our Slack Integration, you can send and receive messages from the Stemma bot.
Messages in channels
If a member of a related group is a team, they will receive a message from the Stemma bot in the corresponding Slack channel.
📘 Note
Make sure you add the bot to all channels that need to receive messages.
Slack private messages
You can choose to receive notifications from Stemma as Slack private messages. To receive messages from the bot, you need to add the Stemma application to your Slack configuration.
To modify your preferred notification method, contact your Stemma representative.
User groups
Each group contains a list of users and teams (Slack channels). In Stemma there are four different types of user groups related to a table:
Stemma will only show the options that apply to the given asset. For example, if a user is sending a custom message for a table that does not have any downstream lineage there will not be an option for Owners of derived tables.
If you don’t want Stemma to ingest all the available data in the applications you integrate (for example if you want to ingest just one schenma from a data lake), you can tell Stemma what to ignore: Contact Stemma by email ([email protected]) or talk to your support rep.
How Does Stemma Discover Data Lineage?
When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.
In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
Note
When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
Apart from Administrator (see What Can an Administrator Do?), Stemma does not at present support a variety ofuser roles. This means that every user has edit capability; that fosters adoption and comprehensive use, allowing the broad community to get familiar with Stemma and learn its full range of capabilities.
As our user base becomes becomes more expert with Stemma and starts to specialize, we expect to introduce user roles.
What Can an Administrator Do?
To become a Stemma administrator, or nominate one, contact Stemma support ([email protected]).
How Can I Tell What Stemma Automates and What I Need to Curate?
Automated documentation
Stemma provides out-of-the-box integration with applications such as Snowflake, dbt, and BI tools such as Looker, Mode and Tableau to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.
The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.
Stemma also allows linking of existing Slack conversations about data to the data catalog.
Curated documentation
Items you need to curate include editable descriptions, tags, and owners. All of these can also be modified by APIs. APIs can also ingest custom metadata.
In the Stemma UI, items that you can curate will show a pencil when you hover your cursor over them; for example:
If there's a Curated and an Ingested Description, Which Will I See in Stemma?
If Stemma ingests the description of a table or column (for example, from Snowflake) and you subsequently edit or replace that description in Stemma, Stemma users will see that curated description from then on, even if Stemma subsequently re-ingests the original Snowflake description (or any other version of it).
At present, Stemma does not provide a way for you to see both the Stemma description and the ingested description side by side.
How Do I Add a User?
At present you need to contact Stemma support ([email protected]) with the names of the users you want to add. Stemma will add them for you, and you then need to authenticate them, making sure they have the access they need to do work in Stemma (see the sections on integrating SSO apps listed here) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake.
How Do I Update an Existing Connection?
Updating a Connection
At present you need to add a new connection and delete the old one. You need Stemma’s help to do this.
First add a new connection as follows:
In the Stemma UI, choose Admin from the toolbar at the top.
Click the button for Add a New Connection.
Do not try to edit the existing connection.
In the pop-up window, choose the connection type and provide the information as prompted. See What We Need From You.
Click Submit.
This creates a new connection that is not yet active.
Now, contact Stemma support ([email protected]) with a request to activate the new connection. Stemma will let you know when the connection is active.
Once you have verified that the new connection is active and working properly, you can delete the old connection:
In the Stemma UI, choose Admin from the toolbar at the top.
Click the Delete button next to the connection you want to delete, and in the pop-up window confirm that you want to delete it.
How Do I Link a JIRA to Stemma?
To link a JIRA issue to a table in the Stemma catalog, follow these instructions.
What Effect Does Deprecating a Table Have?
You can set the status of a table or dashboard as deprecated or intermediate. When you do so, that asset remains in the Stemma catalog, but is removed from default searches (you can change this by means of filters in Advanced Search).
When you deprecate an asset or change its status to intermediate, you should modify its description accordingly, and you should also be careful to message all users and owners of the asset (including users who might be using the asset indirectly, such as dashboard users of a table) to warn them about the change in status. This allows you to make sure you hear from all stakeholders before you remove or move the asset.
What Happens when a Table or Dashboard Is Removed?
Removing a table or dashboard causes it, by default, to be removed from Stemma after ten days. Similarly, if the table or dashboard is renamed or moved, the original entry is removed from Stemma after ten days.
Before removing an asset, you should deprecate it to give stakeholders sufficient warning.
How Does Stemma Calculate "Top 5 Columns", "Frequent Users", etc?
The table details page in the Stemma UI provides usage information, derived as follows:
To calculate the top five upstream and downstream columns, Stemma multiplies the number of unique users by the number of queries to get the ranking. The time frame is 30 days.
To identify tables commonly usesd with this one, Stemma counts the joins that were part of any query in the past 90 days.
To identify frequent users of a table, Stemma counts any user who has run a query on this table in the last 30 days.
BigQuery Integration
Stemma extracts table and column metadata including database, schema, table name, table description, column name, and column description from a Bigquery database.
What We Need From You
You will need to create a service account for reading metadata and grant it “BigQuery Metadata Viewer” access to all of your datasets. This can all be done via the BigQUery UI.
Next, collect the information Stemma needs, as follows:
Authorization/Access Scopes: Stemma uses the following scopes to extract metadata and usage information. — <https://www.googleapis.com/auth/bigquery.readonly> — <https://www.googleapis.com/auth/cloud-platform>
A JSON Key File: Proceed as follows to create the JSON key file for a service account.
In the Service account name field, enter a name. The Cloud Console fills in the Service account ID field based on this name.
In the Service account description field, enter a description. For example, `Service account for Stemma BigQuery Integration
Click Create and continue.
Click the Select a role field, and make sure to select the following: — BigQuery Data Viewer (roles/bigquery.dataViewer) – to extract the metadata, watermarks and usage information. Popular tables will be generated using this information. — BigQuery User (roles/bigquery.user) – Needed to map the user’s information with the Tables i.e., Frequent Users, Table Owners, etc. — Private Logs Viewer (roles/logging.privateLogViewer) – Needed for enriching table usage.
Click Continue, and click Done to finish creating the service account.
Do not close your browser window. You will use it in the next step.
Create a service account Key:
In the Cloud Console, click the email address for the service account that you created.
Click Keys, click Add key, then click Create new key.
Click Create. A JSON key file is downloaded to your computer.
Click Close.
Provide the information to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose Google Cloud as the Connection type, provide a Connection Name, and paste in the Service Account Key from the JSON file you have just downloaded.
When you are finished, click Save Connection.
Migrate from existing Amundsen
Migrating from Amundsen involves three main steps:
1 – Migrating existing metadata from Amundsen to Stemma
In this step, you take all the existing metadata from Amundsen and load it into Stemma. The metadata imported includes, but is not limited to:
Table and column descriptions
Tags
Ownership information
To import this metadata, upload a dump of your Amundsen neo4j metadata into Stemma:
Bring the Amundsen storage offline.
Use dump -> load, available on all neo4j installations, to dump and upload your data.
After import, your Stemma instance will be up to date with your Amundsen instance. However, Stemma has a significant number of additional features so the layout will appear different. Check out the quick start page here:
Multi-step column-level lineage on the Lineage tab of table details in Stemma
But your data won’t be periodically updated until you enable periodic updates; see the next step.
2 – Ingesting metadata continuously
You need to configure Stemma to ingest metadata updates on an ongoing basis. Once this configuration is done, you won’t need Python databuilder jobs to ingest data.
Information ingested in this step includes:
Table and column names
Linked issues (JIRA tickets)
Frequent users
Lineage, if it exists
IIn Stemma’s Admin interface, provide credentials to your data sources to allow Stemma access to extract metadata on an ongoing basis:
Stemma will now do the work to “upsert” metadata updates, as they occur, into the data you uploaded from Amundsen in the previous step.
3 – Cut Over
At this point, Amundsen and Stemma are running side-by-side, and Stemma recommends sharing access to Stemma with your power users and getting their feedback to ensure all of their use-cases are supported. Usually one week of overlap to obtain feedback and an additional week to incorporate feedback, if applicable, is sufficient.
When all looks good, simply redirect the URL of your internal Amundsen to the Stemma URL.
If you have any feedback or questions about moving from Amundsen to Stemma integration, please reach out to [email protected]!
How Does Stemma Discover Data Lineage?
When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.
In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
Note
When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
Apart from Administrator (see What Can an Administrator Do?), Stemma does not at present support a variety ofuser roles. This means that every user has edit capability; that fosters adoption and comprehensive use, allowing the broad community to get familiar with Stemma and learn its full range of capabilities.
As our user base becomes becomes more expert with Stemma and starts to specialize, we expect to introduce user roles.
dbt integration
Stemma integrates with your dbt Cloud or Core projects and is able to ingest and keep up to date all of the metadata in the table below.
Integrating with dbt Cloud requires minimal information and access to the dbt Cloud account of your organization and an administrative access to Stemma.
The following 3 properties are needed from dbt Cloud:
Account ID
Usually visible within the browser URL on the account settings page: https://cloud.getdbt.com/next/settings/accounts/<account-id-number> or within any any dbt Cloud project: https://cloud.getdbt.com/next/deploy/<account-id-number>/projects/<project-id>.
For organizations on the Team Plan, the account wide permissions required are Metadata and Read-Only.
For organizations on the Entreprise Plan, the necessary permission set is Account Viewer and Job Viewer for all projects that should have metadata ingestion.
Host
Unless organization specific customization has taken place, the host is likely cloud.getdbt.com
With these credentials in hand, all that’s left is adding the connection information to Stemma:
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection.
In the pop-up window, choose dbt as the Connection type, provide a Connection Name, and fill in the API Token, Host and Account ID collected from dbt Cloud.
When you are finished, click Save Connection
dbt Cloud integration may take up to 24 hours for initial ingestion.
The S3 prefix path will be “dbt/{date}” where “{date}” is the current days date in UTC in the format “YYYY-MM-DD”. For example, for a file delivery on 04/01/2022 we would expect to see the following files:
Stemma indexes Tableau dashboards along with their underlying tables, providing a one-stop search for all data assets. Stemma also tracks how many times a dashboard has been viewed, instantly highlighting which is most commonly used.
Is the right data being used?
Stemma automatically correlates your Tableau dashboards to the corresponding table asset by reconciling the table names extracted from the Tableau Metadata API with the existing table metadata Stemma has captured from your warehouse. This allows you to start your exploration by finding trustworthy reports and, in turn, using the lineage to discover which tables are useful to you.
You can also explore the dashboard’s lineage upstream to get an end-to-end view of where the data came from.
How can I use the metrics in this dashboard?
Tableau provides Stemma with the SQL that it creates to generate each sheet, and Stemma indexes this SQL so you can easily find the code. You can copy and paste the code into your IDE to jump-start the SQL exploration.
Best Practices for Data Ownership
👍 The three types of ownership that work well
It works well to break down ownership of the most important date into three distinct roles:
Delivery owner — Ensures that this particular data gets delivered on time, on a promised delivery SLA. Usually, this is a data engineer or analytics engineer responsible for developing and maintaining the pipeline that produces this data.
Domain owner — What does this particular value in a field (or column) mean? When does this particular event get triggered? Usually, this is a product engineer who created the event or the analysts and data scientists who use this data most frequently and understand the physical reality that the data represents.
Policy owner — Ensures that this data gets used according to the classification and terms associated with it. Sometimes you acquire data from a source that should not be used for a certain category of use-cases. For example, YouTube is permitted to show ads to kids, but not permitted to personalize them. Therefore, the personalization data can’t be used if the subject is a child. The person making these calls is usually not an engineer or data scientist, but someone on the Policy or Privacy team at the company.
👎 Types of ownership that often don’t work as well
Overall Quality owner: In practice it’s often hard to find people able and willing to take ownership of the end-to-end quality of a data set. This is because data engineers don’t consider themselves owners of the data that’s produced from the upstream application and don’t want to be responsible for hunting down a website bug that impacts the data in the warehouse. Product engineers, on the other hand, don’t have enough context about how data gets joined and transformed downstream to own the final derived data artifact. This may change as decentralized data management (or data mesh) is deployed more broadly.
Shared ownership: This often does not work in practice, though it can work well if the owner group shares a good understanding of the different types of ownership and the group can efficiently redirect each question to the right person.
🏥 The data engineer’s role
In practice, the data engineer plays the role that a triage nurse would play in an Emergency Room. When an issue arises (like a patient arriving in the ER), the data engineer triages to see what’s going on. Sometimes it’s a problem the engineer can fix), so they fix it and resolve the issue (like the nurse treating an injury on the spot). In other cases, the engineer redirect the issue to the appropriate owner (as in a referral to a doctor or other health practitioner). In short, the issue determines who the owner should be.
Using Stemma
Use Stemma to assign and document ownership. Divide ownership as follows:
Stemma’s advanced search helps you quickly narrow your criteria and find the data assets you want to see:
Search bar: The search bar allows you to enter free-form text to search across nearly every piece of metadata in your entire data ecosystem. This includes table names, descriptions, column names, column descriptions and more.
Filtering: You can use the following powerful filters, separately or together, to narrow your search to find the assets you need:
Resource filtering: allows you to choose what type of resource to return in the search results; options are Tables, Dashboards, and Users. By default, Tables and Dashboards are selected.
Asset status filtering: filters results to only include the relevant statuses. The options are Certified, No Status, Deprecated, and Intermediate. The default options are Certified and No Status.
Table filters: these apply only to the Tables results. If you have another resource type besides Tables selected in your resource filter, the values used in the table filters will not affect those other results. Table filters allow you to filter results by the data source, database, schema and table name. You can also filter by column; this will return any table that contains that column name.
Dashboard filters: these apply only to the Dashboard resource results. If you have another resource types besides Dashboard selected in your resource filter, the value used in the dashboard filters will not affect those other results. Dashboard filters can be used on the product, folder or dashboard name.
Table & Dashboard filters: These can be applied to both Tables and Dashboards resource types. You can filter both Tables and Dashboards by the tag that is associated with them.
Advanced filtering
By default, all filters look for an exact match for the value you provide. But the Table, Dashboard and Table & Dashboard filters allow for advanced wildcard searches.
To use a wildcard search, you can place one or more asterisks * in the filter box. Here are a few examples:
Filter to include only schemas that start with marketing_
Filter to only include schemas that end in _prod
Filter to only include schemas that have _user_ in them
You can also combine multiple filters together!
For example, you might want to find all tables in the marketing schema that begin with marketing and end with raw–but only tables that include the column order_id.
Understanding the Results
Stemma’s advanced search naturally ranks your search results based on their actual usage. This means the more users run queries on a table the higher the rank that table is likely to have in your search results. In addition, the result list shows you additional reasons why your assets were returned.
This example shows the search results for the search term “orders”:
Points to note:
Any words in the table or dashboard description that match the search term are highlighted
Any column names that have an exact match or a similar match are listed to the right
Any column description that matches the search term is displayed on the right
Delta Lake Integration
Proceed as follows to create a Delta Lake connection to Stemma.
Step 1: Generate a Personal Access Token
Follow these instructions to generate the Delta Lake personal access token that Stemma will need to extract your data into the catalog.
Step 2: Provide the Personal Access Token and Hostname to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose Delta Lake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows
Provide the Personal Access Token you created in Step 1
Provide the fully-qualified domain name of the Delta Lake host; for example mydlhost.mydomain.com.
When you are finished, click Save Connection
How Can I Tell What Stemma Automates and What I Need to Curate?
Automated documentation
Stemma provides out-of-the-box integration with applications such as Snowflake, dbt, and BI tools such as Looker, Mode and Tableau to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.
The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.
Stemma also allows linking of existing Slack conversations about data to the data catalog.
Curated documentation
Items you need to curate include editable descriptions, tags, and owners. All of these can also be modified by APIs. APIs can also ingest custom metadata.
In the Stemma UI, items that you can curate will show a pencil when you hover your cursor over them; for example:
If there's a Curated and an Ingested Description, Which Will I See in Stemma?
If Stemma ingests the description of a table or column (for example, from Snowflake) and you subsequently edit or replace that description in Stemma, Stemma users will see that curated description from then on, even if Stemma subsequently re-ingests the original Snowflake description (or any other version of it).
At present, Stemma does not provide a way for you to see both the Stemma description and the ingested description side by side.
Assigning Ownership of Tables
You can assign ownership of a single table to individuals or teams. You can also assign owners to a set of tables in bulk, adding all the owners in a single transaction.
Note: You can assign owners to tables in Stemma, but not to dashboards. Stemma ingests ownership of dashboards from the underlying BI tool(s) and you need to make updates and modifications in those tools.
The importance of ownership
Ownership is one of the fundamental pillars of a data catalog. It defines responsibility by individual users and teams, adding tremendous value to the data in the catalog, including:
Delegation of duties: Which users are responsible for keeping this table properly documented?
Compliance with processes and procedures: Has the right team updated the documentation for this table before the newest pull request is merged?
Reporting and analytics: Which teams have the highest percentage of their assets documented? The lowest?
You can assign ownership of a table either to individuals or teams:
Individuals are the people in your company who use Stemma. You can find individuals by their email address or a Slack handle.
Teams are represented in Stemma by Slack channels.
To assign one or more owners to an individual table, choose the table in the Stemma UI and click on Add Owner at the bottom of the left panel.
But Stemma also allows you to bulk-assign table ownership across the catalog. You can do this in a single transaction, immediately allowing your users to identify the assets that belong to them.
Assigning owners in bulk
You may think about ownership as a hierarchy, or your teams may typically own all of the tables within a single database or schema, or perhaps your analysts use Dbt to create their own workflows and create tables with specific naming conventions (e.g. marketing_sales, marketing_prospects). Regardless of how your company thinks about grouping assets, Stemma can help you to bulk-assign owners, using advanced search to find any of the assets you want to update and assigning the owners with a single command. To assign ownership of tables in bulk, proceed as follows:
From the home page of the Stemma UI, navigate to Advanced Search.
At the top of the left panel, make sure only the Tables resource is checked (you can’t bulk-assign owners to dashboards).
Optionally use the filters in the left panel to limit the search results to the set of tables to which you want to bulk assign ownership:
Stemma suggests you limit your search results using only the filters on the left; while you can also use the search bar at the top, doing so makes it more likely that tables that you don’t want included slip into the results
Click on Bulk Edit at the bottom right of the page and choose Add Owners.
In the dialog box, enter or select an email address or Slack channel and choose Add.
To add more owners, repeat Step 5 for each new owner.
When you are finished, click Add Owners and then Done.
Stemma automatically pulls in the teams you’ve defined as well as the list of users in your organization:
Note: When you assign ownership in bulk, this does not trigger automatic messages to the owners and users; that only happens when you change to assign ownership individually.
Displaying Data Lineage in the Stemma UI
Navigating the Lineage Graph
Column filtering
From the root node, search for columns of interest using the search bar in the right hand node panel
Click on columns that have a filter icon to prune the graph to only column-level relationships for the selected column
To examine columns in a related node, select that node
Below the search bar in the node panel you can select between “Filtered” and “All columns” to view the appropriate list of columns for the selected node
To filter by a different column, simply select the root node and then select any other column that has a filter icon
To remove the filter, simply select the root node an then re-select the filtered column
Focus mode
Select any node in the graph to initiate focus mode and make it easier to see tables and dashboards related to the selected node and the root node
When a collapsed node is part of the graph, it will automatically show the relevant table or dashboard with an expand icon
Click the expand icon to break the table or dashboard out of its collapsed group
Dashboards
When the selected node is a dashboard, the right-hand node panel will show charts and queries instead of columns
If the dashboard is the root of the table, any chart with lineage will display the filter icon which can be clicked to show only upstream data sources for that chart
Finding Tabular Lineage for Columns
End-to-end lineage
Navigate to the details page for the table
Click on the Lineage tab near the top of the page
In the Filter by column dropdown menu near the top of the Upstream or Downstream section, select the column with lineage that you are interested in
The table will now show related tables and columns alongside their distance from the current table
Simple
Navigate to the details page for the table.
Click on About tab near the top of the page.
Click on the up-arrow next to the column whose lineage you want to see
The row will expand to show column details, including the number of immediate upstream and downstream relatives of the column
Click the upstream or downstream links to see a pop-up table of the related columns
Evaluate data – For example, if you are not seeing the results you expected from a given table or dashboard, the problem may be one of timing: use the UI to check all upstream tables and make sure they ran when they should. To make this even easier, you can filter the graph to trace a specific column’s lineage.
Plan changes: The interconnected nature of data products– with multiple products often chained together to build upon each other– means that even the smallest change in a pipeline can ripple downstream and disrupt your business. These small but potentially disruptive changes are frequent and necessary:
An analytics engineer builds a table using Dbt; that table becomes popular and several dashboards are built on top of it. It becomes well trusted, and the logic becomes a standard for calculating certain metrics. But the table is missing several important entity IDs (foreign key columns) which makes it difficult to join to other tables. The analytics engineer and a data engineer work together to complete a “micro migration” to a new, more performant and reusable table.
A machine-learning model is used to predict churn for a customer. Each day after the model runs, users that are predicted to churn are identified in a table in Snowflake and then sent an email to re-engage them. The product team is launching the second major product and wants to measure churn on a per-product basis. The engineering team updates their data pipeline to add a new column to the output table that identifies the product that churn is being predicted for.
An analyst created several views to consolidate the common logic that the team were using to query tables in their dashboards. But the performance of the views has deteriorated as the amount of data has grown. The analyst and the engineer work together to create materialized versions and to update all of the dashboards using the prior views.
Use Data Lineage in Stemma to Identify Affected Assets
The challenge with such changes is that it is hard to accurately curate a list of all assets that will be affected. If you cannot be certain of what will need to be changed or what could break, you cannot do risk analysis or even make a time estimate. Stemma enables you to do the analysis and create the appropriate messaging.
The key question for this type of change is “What tables and dashboards will I need to update if I rename this view?”
Stemma provides a lineage graph for each table and dashboard. From here you can see who and what is utilizing your resources downstream, and how popular your assets are. Data owners can use this in conjunction with their own knowledge to build roadmaps and plan how data should evolve.
Use Stemma Messaging to Communicate Expected Changes
Everyone affected needs to know what is going to change and when. After building your list of affected assets and their users, use Stemma messaging to provide advance notice and instructions.
How Stemma Discovers Data Lineage
When you execute SQL against your warehouse, the database captures the raw SQL and saves it in a historical format that includes the time the SQL command was executed and the user who executed it. Stemma reads these SQL logs and parses all of the commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page.
In the case of table and column lineage, any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
Note
When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
Download the Client ID and Client Secret JSON file and provide them to Stemma by email or Slack.
How Do I Add a User?
At present you need to contact Stemma support ([email protected]) with the names of the users you want to add. Stemma will add them for you, and you then need to authenticate them, making sure they have the access they need to do work in Stemma (see the sections on integrating SSO apps listed here) and that Stemma user names and authorities map properly to users of integrated applications such as Snowflake.
Hive Integration
What we need from you
Stemma needs certain information and credentials to extract Hive metadata into the catalog. Contact Stemma by email or Slack, and provide the following:
Metastore Host: Hostname is the IP address of the Hive server to which you are connecting.
Username and Password: You will need to provide a username and password for Stemma to use to access the Hive schema.
Metastore Port: Server port used for accessing metadata about hive tables and partitions. The default Hive metastore port is 9083.
List of Databases: Stemma whitelists databases, and so you will need to provide a list of the databases we will be importing.
Metadata extracted
The metadata Stemma extracts includes:
TBLS – stores basic information about Hive tables, views, and index tables.
DBS – stores the basic information of all databases in Hive.
PARTITION_KEYS – the field information of the table storage partition.
TABLE_PARAMS – stores the attribute information of the table/view.
SDS – saves the basic information of file storage, such as INPUT_FORMAT, OUTPUT_FORMAT, whether it is compressed or not.
COLUMNS_V2 – stores the field information corresponding to the table.
PARTITIONS – stores the basic information of table partitions.
How Do I Update an Existing Connection?
Updating a Connection
At present you need to add a new connection and delete the old one. You need Stemma’s help to do this.
First add a new connection as follows:
In the Stemma UI, choose Admin from the toolbar at the top.
Click the button for Add a New Connection.
Do not try to edit the existing connection.
In the pop-up window, choose the connection type and provide the information as prompted. See What We Need From You.
Click Submit.
This creates a new connection that is not yet active.
Now, contact Stemma support ([email protected]) with a request to activate the new connection. Stemma will let you know when the connection is active.
Once you have verified that the new connection is active and working properly, you can delete the old connection:
In the Stemma UI, choose Admin from the toolbar at the top.
Click the Delete button next to the connection you want to delete, and in the pop-up window confirm that you want to delete it.
Sharing additional metadata via CSV files
Along with a direct connection to various data sources, Stemma can ingest data using extracts in various file formats. One of the most commonly used file formats is CSV.
Below are some of the CSV Formats Stemma ingests today (along with the sample files), with the column names and values we expect in each file. Note that you need the column header literals (for example,db_name, etc.) as well as the values.
Github Links
Used to provide a link to a Github source from Stemma’s Table detail page.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. source: the Github URL for the repository source_type: static value, “github”
Airflow
Used to generate a link between a table and an Airflow task. Once the data is ingested, you will see an “Airflow” button on the Stemma detail page for the table.
task_id:: the Airflow ID for the task dag_id: the Airflow ID for the DAG exec_date: the most recent execution timestamp for the task application_url_template: the URL for the DAG in your Airflow instance db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. schema: the name of the schema within the database. table_name: the name of the table within the schema. cluster: the name of the database within the data source.
Dagster
Used to link a Dagstr op and/or job to a table in Stemma. Once the data is ingested, you will see an “Dagstr” button on the Stemma detail page for the table.
application_url_template: the URL for the DAG in your dragster instance db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema.
Table Tags
Stemma can create and assign tags via a CSV file, with the following format:
db_name
cluster
schema
table_name
tags
hive
gold
test_schema
test_table1
“tag1,tag2”
hive
gold
test_schema_2
test_table2
“tag1,tag3”
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. tags: the tags to be created. Tags are additive: existing tags on tables will not be modified.
Table Description
db_name
cluster
schema
table_name
description
hive
gold
test_schema
test_table1
Description for table, optionally markdown.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. description: the table description. Note that descriptions are destructive: if a description already exists, it will be overwritten.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. owners: email addresses or slack channels. Owners are additive: existing owners of tables will not be modified.
Column Descriptions
db_name
cluster
schema
table_name
col_name
description
hive
gold
test_schema
test_table
test_column
This is an example column description.
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc. cluster: the name of the database within the data source. schema: the name of the schema within the database. table_name: the name of the table within the schema. col-name: the column described. description: the column description. Note that column descriptions are destructive: if an existing description is present, it will be overwritten.
JIRA Integration
This document walks you through the steps you need to take to integrate Stemma with your JIRA installation. You must be a JIRA Admin to do this.
Steps
Step 1 – Collect information about your JIRA environment
Collect the following information to provide to Stemma (see Step 3):
The URL for your JIRA environment (e.g., <https://jira.net>)
The project ID for the JIRA project you would like Stemma to create tickets in See this page for guidance from Atlassian on how to obtain the project ID for a JIRA project
If using JIRA Server or Cloud, the ID for the JIRA Bug issue type
Note: Skip this step if using JIRA on-premise
See this page for guidance from Atlassian on how to obtain the ID for the Bug issue type
Optional: The name of one or more labels you would like set on tickets created by Stemma
Step 2 – Create a service account
Stemma will need an email and API token to authenticate to your JIRA environment. Our recommendation is to create a JIRA service account so that the authentication is not tied to an actual user’s account. a. Go to Administration > User management > Users b. Click Create User and fill out the required information c. Create an API token from the service account
For JIRA on-premise this feature is not currently supported so you will provide the password instead
d. Grant permission to the service account
The service account will minimally need access to create and view tickets in the JIRA project you intend to use
You can either grant permissions directly to the service account or you can grant the service account a role
Stemma needs:
The email address of the service account
The API token or password
See Step 3.
Step 3 – Provide the information to Stemma
Provide the information from Steps 1 and 2 to Stemma by email or Slack.
Linking JIRA Issues to Stemma
To link a JIRA issue to a table in the Stemma catalog, you need to include a Table Key tag in the issue description. The tag format is Table Key: {SOURCE}://{DATABASE}.{SCHEMA}/{TABLE} [PLEASE DO NOT REMOVE].
How Do I Link a JIRA to Stemma?
To link a JIRA issue to a table in the Stemma catalog, follow these instructions.
Finding Important Information about Tables and Columns
Stemma provides out-of-the-box integration with applications such as Snowflake, dbt, and BI tools such as Looker, Mode and Tableau to capture rich metadata from each of these systems automatically. For example, in the case of Snowflake, Stemma reads both the information_schema and the query logs, and automatically generates important information such as when a table was last updated, column-to-column lineage, table-to-table lineage, table-to-dashboard lineage, frequent queriers of the table, common join and filter conditions, etc.
The Stemma UI makes it easy for you to discover this important information about your tables and columns, and also provides a convenient way to add and modify curated information.
The table details page provides following information:
About
Tags
Descriptive tags for cross-referencing, such as covid, cases, etc.
Asset status
This can be Certified, Deprecated, Intermediate, or No Status
Owners
The person or people responsible for maintaining this table
Click on an owner to see other assets owned by that user, as well as their bookmarks and frequently used tables. You can also email the user via this profile page.
The date and time of the last update to the table (this can be different from the most recent date covered by the data itself)
Date Range
The date range covered by the data in the table.
Columns
Descriptions are automatically ingested from a source (like dbt or Snowflake). If there is no description to ingest, Stemma will automatically use column-level lineage to search upstream for a description in an upstream column with the exact same name. This documentation is shown as “Autodescribed.”
Column descriptions can be created, and autodescribed documentation over-written, within Stemma. Also available in this table are the data type, related glossary terms and badges.
Lineage
dbt Compiled Query
For dbt users, see the query that formed the table, as well as the dbt model type.
Upstream, Downstream
Upstream and downstream tables and dashboards, sorted by popularity.
Where possible, Stemma captures the information directly from the source. This table shows which attributes are populated from the source and which you can edit:.
Table Metadata
Automatically captured?
Users can edit?
Table description.
✅
✅
Asset status.
❌
✅
Owners
✅ (dbt)
✅
Tags
✅ (dbt)
✅
Jira issues
❌
✅
Last updated time
✅
❌
Data date range
✅
❌
Commonly used with tables
✅
❌
Frequent users
✅
❌
Upstream & downstream table lineage
✅
❌
Dashboards that use the table
✅
❌
Slack threads with table mentions
✅
❌
Looker Integration
This document walks through the steps required to create a service account in your Looker, which will be connected to your Stemma Instance for dashboard metadata extraction.
Stemma uses the Looker APIs (via Looker Python SDK) to extract the metadata of Folders, Dashboard, Dashboard Elements and Explore Queries, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Looker. Read Looker’s docs: https://docs.looker.com/reference/api-and-integration
Stemma takes leverage of the API3 keys (Client ID & Client Secret) to make a secure connection between your Stemma and Looker instances. Below you can find out the steps required to create the new API3 keys for this integration.
What is extracted?
Stemma extracts information about Folders, Dashboard, Dashboard Elements, Explore Queries, and data sources associated with them.
Stemma will not extract Dashboards that are deleted, hidden or in a user’s personal folder.
Steps
Step 1: Make sure you have a Looker Admin role
Looker → Stemma integration requires an Admin role to fetch the correct data and build the relationships between the entities like Table Lineage, Dashboard Owners, etc.
So, as a first step, make sure you have an Admin role.
Step 2: Generate API3 keys
The next step is to create an API3 Key. From the Admin menu, navigate to the Users page.
Find the user you want to use to generate the API3 key, and click the “Edit” button under the Actions column.
Once you are on the “Edit User” page, find the API3 Keys settings, and click “Edit Keys“
Click “New API3 Key” to generate a new Client ID and Client Secret.
Step 3: Provide the details to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Looker as the Connection type, provide a Connection Name, and then fill out the fields in the dialog box as follows:
Base URL Example: <https://stemma.cloud.looker.com/>
You can set the status of a table or dashboard as deprecated or intermediate. When you do so, that asset remains in the Stemma catalog, but is removed from default searches (you can change this by means of filters in Advanced Search).
When you deprecate an asset or change its status to intermediate, you should modify its description accordingly, and you should also be careful to message all users and owners of the asset (including users who might be using the asset indirectly, such as dashboard users of a table) to warn them about the change in status. This allows you to make sure you hear from all stakeholders before you remove or move the asset.
What Happens when a Table or Dashboard Is Removed?
Removing a table or dashboard causes it, by default, to be removed from Stemma after ten days. Similarly, if the table or dashboard is renamed or moved, the original entry is removed from Stemma after ten days.
Before removing an asset, you should deprecate it to give stakeholders sufficient warning.
Advanced Description Editing
The Stemma UI provides a WYSIWYG editor for descriptions of tables and columns.
Using the editor
To begin editing, navigate to the table or column whose description you want to edit, and click on the pencil icon next to Description:
This opens up the text box for editing:
Add or change the description.
Descriptions can include links and images as well as text.
You can choose font styles as indicated by the buttons.
Use the pull-down menu to toggle the text format from the default (Normal) to common formatting elements such as headings, lists, quotes, and code blocks.
When you choose Code Block, you can choose the type of code from the secondary pull-down menu (the default is JavaScript).
You can add hyperlinks: enter and highlight the text to display, then click the 🔗button and enter the URL (and edit it if necessary) in the text box next the button.
The editor also supports Markdown syntax, as well as any shortcuts your system supports (such as ⌘-b for bold on a Mac)
When you are finished, click Update to save your changes.
Linking to other tables from the description
This is useful, for example, if you want to give other users more information about the relationship between the tables.
To link to another table from the description, do the following:
In the Stemma UI, navigate to the table whose description you want to edit.
Click on the pencil icon next to Description.
Enter the new desription providing the related table’s<schema_name>.<table_name> anywhere in the text.
Click Update to save your changes.
The new description will contain the link to the related table:
Sharing column descriptions
You can share a column description with all columns of the same name across the catalog. To do this:
In the Stemma UI, navigate to the table whose column description you want to share.
To add or modify a description, click on the down-arrow to the left of the entry for the column:
Hover over the word Description and click on the pen that appears:
Add or update the description and click on Update. Stemma automatically discovers all columns with the same name, and the description propagates to all linked columns. To see which columns are linked, and add and remove links, click on the three dots to the right of the column entry:
Click on the Edit linked columns pop-up. In the dialog window, you can:search for and add columns to link, and remove existing links.
When you have finished, click Done.
Mode Integration
The Stemma Mode integration accesses the Mode Discovery API. This API is distinct from regular Mode API: it is much more performant but only contains a subset of the available endpoints. This means Stemma is able to access only metadata about reports, not the report runs/results. Stemma also accesses the regular REST API to augment the information retrieved from the Discovery API. In all, there are four values required by Stemma:
Your organization name in Mode
The access token to access the REST API
The password token to access the REST API
A signature token to access the Discovery API
Optionally, if you would like Stemma to only extract metadata from specific Mode spaces you may also provide Stemma a list of the spaces to include.
The access token and password token allow Stemma to augment the information within the Discovery API with the regular REST API to unlock two additional capabilities:
The ability to link your dashboards to the underlying tables in your warehouse which are used to serve the dashboards.
Stemma will use the general Mode API for datasources to retrieve your data source information to find out the name of the database (you may use, PUBLIC, mktg, etc.) as well as the type of data source (e.g. Snowflake, Redshift, etc.) in order to apply the proper SQL parsing semantics.
The ability to rank your dashboards by the number of times the dashboard has been viewed in the past 90 days.
Stemma will use the general Mode API for reports API to retrieve an accurate view count.
Creating the Signature Token
To authenticate with the Mode Discovery API a signature token must be created. This signature token can be created by running a curl command using the account’s REST API access token and password secret. The following code is a snippet from Mode that has been slightly adjusted to create a signature token that Stemma can use. Make sure to replace {organization} , {your-access-token} and {your-access-secret} with the appropriate values (organization can be found in the URL when you visit Mode, e.g. https://app.mode.com/<organization>/spaces). Visit the Mode documentation for information on creating the api-key and api-secret.
Please note, if this does not work it likely means you do not have permissions. Please contact Mode support with help creating a read-only signature token.
APIs used
APIs are listed in the table below. In the table below, Data Extracted is the data we actually store from the API (and may change as we add functionality). Risks highlights any potentially-sensitive information in the API (whether or not we actually store it).
Please note, that the “Report Stats” endpoint is not enabled by default in mode, so functionality coming out of it may not be available, until the endpoint is enabled.
Low: contains access to information about the data source. Stemma should separately have access to this information but this API access is required to link a mode dashboard to a specific database.
Notes
Here are some additional notes about possibly sensitive information we may receive from Mode
Dashboard execution times are non sensitive on their own, although it is possible to infer NPMI from who is running which dashboard queries at precise times. Stemma only displays aggregated forms of this information, but stores it disaggregated.
Query names may contain NPMI (e.g. a query titled “EMEA Expansion Potential”, when the company hasn’t announced its intention to expand into EMEA).
Raw SQL queries may contain sensitive information (PII), typically in the form of filter clauses. This is generally not allowed under most security policies especially under Open Collections, however, it is a possibility. By default, these SQL statements will be shown to users in Stemma.
Configuring Your Stemma Connection
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, optionally add a name for the connection, and choose Mode from the Connection Type drop-down menu.
How Does Stemma Calculate "Top 5 Columns", "Frequent Users", etc?
The table details page in the Stemma UI provides usage information, derived as follows:
To calculate the top five upstream and downstream columns, Stemma multiplies the number of unique users by the number of queries to get the ranking. The time frame is 30 days.
To identify tables commonly usesd with this one, Stemma counts the joins that were part of any query in the past 90 days.
To identify frequent users of a table, Stemma counts any user who has run a query on this table in the last 30 days.
The Stemma UI makes it easy for you to discover this important information about your dashboards, and provides a convenient way to add and modify curated information.
Status
This can be Certified, Deprecated, Intermediate, or No Status
Description
A brief summary of the contents of the dashboard:
Owners
The person or people responsible for maintaining this table:
This information is extracted automatically from the source and (unlike table ownership) can’t be edited in Stemma.
Tags
Descriptive tags for cross-referencing
Contents
Total Views
Total number of views over the life of the dashboard
Created
The date and time the dashboard was created
Last Updated
Date and time of the last update to the dashboard
Last successful run
Date and time the dashboard was last run successfully
Last run
Date and time the dashboard was last run (successfully or not)
Tables
Tables used by the dashboard
Charts
Charts available from the dashboard
Queries
Queries run from the dashboard. Click on “View Query” to see the code and copy it to your clipboard.
Open folder
Open the dashboard’s folder in the underlying application (you may need to log in to the application)
Open dashboard
Open the dashboard in the underlying application (you may need to log in to the application)
Okta Integration
This document walks you through the steps to create and configure an Okta application that Stemma can use to authenticate users.
Note: You must be an Okta administrator to do this.
Steps
Log into Okta admin portal
In the left frame, select Applications :
Select Create App Integration:
Select the options for OIDC – OpenID Connect and Web Application.
In the New Web App Integration, enter the following values:
App Integration Name: Your app name, we suggest Stemma Data Catalog
Grant Type: select the following –
Authorization code
Client Credentials (Client acting on behalf of itself)
Refresh token
Sign-in redirect URI:
Use the fully qualified hostname and add the endpoint:
/auth
If you are running a hosted version of Stemma, this should be: https://<customer>.stemma.ai/auth, where <customer> should be replaced with your company name
Sign-out redirect URI This should be the base Stemma URL. Example:
https://<customer>.stemma.ai
Trusted Origin:
If you are self-hosting Okta, provide the Base URI for your Okta Sign domain, otherwise, leave this blank
Assignments:
Select the assignment access you would like within your organization
Stemma will require access by certain APIs to retrieve user information from your organization’s directory.
Navigate to Okta API Scopes :
Make sure to grant permission for the following scopes:
okta.users.read
okta.users.read.self
Provide the following values to Stemma by email or Slack so that authentication can be built into your Stemma deployment:
Okta Discovery endpoint (well-known configs)
Client ID
Client Secret
Using Admin Functions and Reports
API Keys
Allows you to add and delete keys for the Graphql API.
These reports provide an account of your organization’s interactions with the Stemma catalog, including curating work that needs to be done, as well as usage.They can be especially useful when you are rolling our Stemma to your users and scoping the work remaining to do.
Customer Summary Shows the number of data sources, tables, and dashboards that are integrated into the Stemma catalog.
Monthly product usage Shows the number of users and searches by month.
Documentation summary Shows the percentage of columns, dashboards, and tables missing documentation and/or ownership.
Assets without docs Shows the specific tables and dashboards that lack documentation. Click on the drop-down to switch between the two types of asset
Assets without ownership Shows the specific tables and dashboards that lack ownership (click on the drop-down to switch between the two types of asset).
Redshift Integration
What we need from you
Stemma needs certain information and credentials to extract Redshift metadata into the catalog. To provide these, proceed as follows.
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Redshift as the Connection type and then provide the following information in the dialog box:
A username and password for Stemma to use to access the INFORMATION_SCHEMA
The hostname of the Amazon Redshift cluster. You can find it in the Properties tab of your Redshift Cluster Details page. Under the Connection details section, copy the Endpoint up to but not including the colon.
A port number. The default port number for Redshift is 5439. From your Redshift Cluster Details page, you can find the port number under the Database configurations section of the Properties tab.
A list of databases: Stemma whitelists databases, so you will need to provide a list of the databases we will be importing.
Access
The Stemma Redshift integration requires access to two pieces of information within the Redshift install:
information_schema metadata
Query logs
information_schema
The specific tables and views we access include:
PG_GET_LATE_BINDING_VIEW_COLS
svv_external_columns
COLUMNS
Because of the way permissions work in Redshift, these tables are only populated with data to which the user has SELECT access. If you don’t want to use column stats, and therefore want to restrict access for this credential, Stemma will give you the SQL code required to create a materialized view. You create that materialized view as a superuser, and grant our credential the ability to read and refresh that materialized view.
Query logs
Stemma Redshift requires query logs. We can extract these in two ways: push or pull.
Push
The recommended integration involves configuring Redshift to dump user activity logs into S3.
There are multiple methods to achieve this, either using CloudTrail or directly from Redshift. the exact right way differs between organizations (many already have audit logging configured for other purposes).
Pull
Stemma can extract logs from STL_QUERY. However, because this option requires the credentials be a superuser, it is not the recommended integration method.
Column stats – Optional
The Stemma integration user must have full SELECT access to any tables or views for which you want to see column stats.
How Stemma Captures Data: SQL Parsing Overview and Caveats
What Stemma captures
When you execute SQL against your warehouse, the database captures the raw SQL that was executed and saves it in a historical format that includes the time that the SQL command was executed as well as the user who executed it. Stemma reads these SQL logs, along with access history, and parses all of the SQL commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page:
Table and column lineage: any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
User table usage: Stemma uses select statements to associate users to the table(s) being queried; this provides the Frequent Users information on the Table Details page, search rankings, and the user’s profile page.
Commonly Used With: any joins that occur within the SQL, whether on the top-level query or as part of a sub-query, are captured and presented on the Table Details page to help other users understand how two or more tables are commonly joined.
Caveats:
When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
When user IDs in the database do not match the user portion of the user’s email address as recorded in Stemma, Stemma cannot associate the Stemma user to the user in the query logs. This prevents Stemma from creating any user-table usage information.
When system accounts are used to query the data, Stemma is often not able to see which user is executing the query. This is a common practice with BI tools when multiple team members share the same connection to a database and that connection uses a system account.
However, if you are using Mode, Stemma can parse the comments appended by Mode (which do contain the user ID) to find the user and associate that user to the query.
Commonly Used With does not capture joins where the left or right table is a dynamically generated select statement, for example: select * from table_a a join (select * from table_b b1 join table_c c1 on b1.id = c1.id) b on a.id = b.id.
Slack Integration
This document walks through the steps required to install a Stemma Slack application in your workspace.
Prerequisite
Stemma does not publish the Stemma Slack application to the Slack Marketplace. To install this application for your workspace, you need to create a new application using an application manifest file.
If you have an application manifest file, make sure you have it handy before you begin. If you don’t have a manifest file, you can copy and paste the following into a file:
_metadata:
major_version: 1
minor_version: 1
display_information:
name: Stemma
description: Help you enrich your metadata directly via Slack
background_color: "#0a1930"
features:
bot_user:
display_name: Stemma
always_online: true
slash_commands:
- command: /stemma
url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/slash/help
## Replace {SUBDOMAIN} with your Stemma subdomain.
description: Learn how StemmaBot works
usage_hint: help
should_escape: false
oauth_config:
scopes:
bot:
- app_mentions:read
- channels:history
- channels:read
- chat:write
- commands
- groups:history
- groups:read
- links:read
- links:write
- users:read
- users:read.email
settings:
event_subscriptions:
request_url: https://{SUBDOMAIN}.stemma.ai/stemma/slack/events
## Replace {SUBDOMAIN} with your Stemma subdomain.
bot_events:
- app_mention
- link_shared
- member_joined_channel
- message.channels
- message.groups
- team_join
interactivity:
is_enabled: true
request_url: https://{SUBDOMAIN}.stemma.ai/api/stemma/slack/v1/interactivity
## Replace {SUBDOMAIN} with your Stemma subdomain.
org_deploy_enabled: false
socket_mode_enabled: false
token_rotation_enabled: false
Save the file with any prefix and in any location you like, but make sure:
The filename has the suffix .yaml.
You replace {SUBDOMAIN} with your Stemma subdomain.
Steps
Step 1: Verify your application manifest file
Make sure that at least the following properties in the application manifest file match your Stemma endpoint:
features → slash_commands → url:
settings → event_subscriptions → request_url
settings → interactivity → request_url
Step 2: Create and install your Stemma Slack application
Create a new application using the application manifest file.
b. You will be prompted to select a method to create a Slack Application. Select From an app manifest, and choose the workspace in which you want to install the Stemma Slack application.
c. Paste in your manifest file. Make sure you select the YAML format, and paste in the contents of the file (see Prerequisite above).
d. Click Next and verify the scopes and events the Stemma Bot is requesting. Finally, click Create to install the application to your workspace.
Step 3: Provide Signing Secret and Bot Token to Stemma
To secure the communication between Slack and Stemma, you need to provide Stemma with the Signing Secret and Bot User OAuth Token of your newly installed Stemma Slack application:
Signing Secret:
You can find the Signing Secret on the Basic Information page, in the App Credentials section.
Bot User OAuth Token:
You can find this token on the OAuth & Permissions page, in the OAuth Tokens for Your Workspace section.
Now provide the Signing Secret and Bot User OAuth Token to Stemma:
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Slack as the Connection type, provide a Connection Name, and then in the dialog box enter the Bot Token and Signing Secret.
When you are finished, click Save Connection.
Step 4: Receive confirmation from Stemma
Stemma will use the information you provide to integrate your Slack workspace with your Stemma Instance. Once you get the confirmation, you should be able to type /Stemma help in Slack to see usage help for the Stemma Slack application.
Step 5: Make sure to add an app icon 😊
Under the Stemma (Slack Application) basic information Tab, add the Stemma icon in the “Display Information” section. Users will be much more likely to use the bot if it’s easy to identify.
Right-click and “Save as…” and upload this icon to use as the Slack App icon
(Nice to have) Please invite Stemma team member(s) to a shared test channel
At this point, you have the Stemma Slack application installed within your Slack workspace. In order to verify if things are working just fine with no errors, you can create a temporary shared channel and invite a couple of members from the Stemma team to that channel.
When you are planning to make a backward-incompatible change, you need to proceed in a way that minimizes the impact and gives downstream stakeholders as much warning as possible. Examples of backward-incompatible changes are:
Changing a schema (type changed, name changed)
Changing values (data fields will no longer be populated, or will be populated differently)
You need to understand and communicate the impact before you make such a change. This is usually done in the following steps:
Understand the impact: understand who uses or queries this data.
Provide heads-up notification: notify downstream owners and users that this data is going to change, and that they need to update their assets (downstream tables or dashboards) accordingly. Provide a time window after which the change will go into effect.
(ideally): Understand the impact again: see if downstream users have changed their assets; how many still need to be nudged, etc.?
Make the actual change and follow up to make sure that it’s complete.
Stemma can help you make the change as painless as possible. Use the UI to assess the effect on downstream users and on other assets. The following sections provide help:
Now warn all owners and frequent users about the change, and give them a deadline to make their adjustments.
Note
If you decide to change the name of a table to indicate that it’s deprecated (e.g., to something like deprecated_tablename), that will cause it, by default, to be removed from Stemma after ten days.
Snowflake Integration
What we need from you
Stemma needs certain information and credentials to extract Snowflake metadata into the catalog. To provide these, proceed as follows.
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection:
In the pop-up window, choose Snowflake as the Connection type, provide a Connection Name, and then in the dialog box fill out the remaining fields as follows:
Credentials We support authentication with a username and either a password or a private key.
Username and Password: You need to provide a username and password for Stemma to use to access the INFORMATION_SCHEMA as described above.
Username and Private Key: If you set up authentication via key pairs you will need to provide Stemma with a private key to use when accessing INFORMATION_SCHEMA as described above. If your private key is encrypted, you will also need to provide Stemma with the passphrase.
Snowflake Account Name: If you use the snowflake web interface, the URL you use should look something like abc.snowflakecomputing.com. Everything that appears before Snowflake computing is your account name.
Warehouse: The name of the warehouse (also called Virtual Warehouse) Stemma will connect to.
Role: The role Stemma will use once connected. If Stemma is extracting query logs, this role will need to have MONITOR access on the warehouse.
When you are finished, click Save Connection.
Access
The Stemma Snowflake integration requires read access to all tables that will be included in the catalog, plus query logs.
Metadata
The specific tables and views we access include:
TABLES
COLUMNS
Due to the permissions model of Snowflake, these tables are only populated with data to which the role has SELECT access. Metadata will be extracted one database at a time since each Snowflake database has its own INFORMATION_SCHEMA.
Query logs
Stemma extract logs from the QUERY_HISTORY table function in the INFORMATION_SCHEMA. Logs are used to inform lineage, usage, common usage patterns and more. Due to the permissions model of Snowflake, this table is only populated with queries executed in virtual warehouses that the role has MONITOR access on.
Access History
Stemma uses ACCESS_HISTORY to enrich lineage derived from query logs. To grant Stemma access, Stemma’s user role will need to have the IMPORTED PRIVILEGES:
grant imported privileges on database snowflake to role stemma_role;
If Stemma cannot read Access History, lineage will still work however accuracy will be reduced.
Note: If you try giving the Stemma role access only to the required view, without giving access to the SNOWFLAKE database and ACCOUNT_USAGE schema before that, Snowflake will probably reject your request with the “Grant not executed: Insufficient privileges.” error.
Using the External GraphQL API
Creating and Installing an API Key for the External GraphQL API
1. Create an API key in the Stemma UI:
a. Navigate to the Admin panel and choose Add API Key:
b. In the pop-up window, enter a name for your API key and click Submit. You can choose any name, but Stemma recommends using a name that describes the intended use of the key (e.g., github action integration):
Stemma generates an API key that appears in the window. Save this key somewhere secure. It is not recoverable.
Now click Docs in the upper right of the Altair UI; you should see the graphql schema on the right hand side:
Tableau Integration
Usage information
The Stemma Tableau integration will sum view counts for all workbooks in a particular dashboard over all time.
Installation Instructions
This document walks through the steps required to create a service account in your Tableau, which will be connected to your Stemma Instance for metadata extraction.
Stemma uses the Tableau Metadata API (GraphQL) to extract the metadata of projects, workbooks, dashboards, and data sources associated with them. The Stemma integration is purely read-only and doesn’t modify any assets in Tableau. Read Tableau’s docs: https://help.tableau.com/current/api/metadata_api/en-us/index.html
Connection Type:
We support both Online and Extract Connections with no differences.
Steps
Step 1: Decide your authentication strategy
The Tableau Metadata API is authenticated using a Personal Access Token (PAT) attached to a specific user within your organization. You will always share with us the PAT, not the username/password of the user itself. The PAT has the same permissions as the user to which it is attached.
Most users simply will simply create a PAT attached to their admin account. This is the easiest to set up and simplest to manage longer term, and we recommend it. For this option, skip to Step 4.
However, because this means the PAT has write access to the instance and Stemma doesn’t use write access at all, this doesn’t strictly follow principle of least privilege. In cases where restricting such access is critical, we recommend creating a read-only service account, and then you’ll generate a PAT for that limited account. Continue to Step 2 for this option.
Step 2: Create a new Viewer user
If you decide to create a new user for this integration (otherwise skip to Step 4 to use your current user to generate the Personal Access Token), navigate to the Users menu from the side navigation.
Click on “Add Users” and select “Add Users by Email“. A popup will appear with a Form to create new Users.
Select the “Tableau” radio button, and write down the email address you want. You will not share this email address with Stemma, and use this to sign in to the Tableau instance in order to create a PAT.
Email Tip: If you use Google email, [your_current_user][email protected][your-domain] will allow you to receive emails on your current email address and satisfy the uniqueness requirement.
Select “Viewer” as the Site role, which will give this user minimum access to the Tableau, and click “Add Users” to finalize the user creation.
Step 3: Set up permissions for the user
The next step is to make sure the user has permission to read your externally connected data sources. This will be used to link the workbooks with the tables in Stemma. By default, a “Viewer” user will not have access to any data source.
Click on the “External Assets” from the side navigation, click on “Select All“. A new dropdown will appear. Click on the “Actions” dropdown and select “Permissions“.
A Permissions dialog will appear under the “Permission Rules” section. By default, you will be on the “Database” tab.
Click “+ Add Group/User Rule” and select the user you created in Step 2 of this document, set the Template “View” and Save the rule.
Click on the “Tables” tab, and repeat the steps to add the rule for the user with the “View” template.
Stemma uses information about the upstream data sources in order to automatically create data lineage from your dashboard to your upstream table. In order to enable this, Stemma needs access to read the data source metadata. To enable the Stemma user permissions required by the Tableau Metadata API, you will need to provide “View” and “Connect“” access to each data source and workbook that Stemma will read from.
Step 4: Create a Personal Access Token
Log in to your Tableau instance with the user you want to use for Stemma integration. If you created a new user for this integration, make sure you log in as that user.
Select My Account Settings from the dropdown of your profile.
Scroll down to the “Personal Access Tokens” section, enter the token name “stemmaViewer” and click “Create new token”.
A new pop-up will appear with the personal access token and secret. Make sure you copy these somewhere safe as they will not be displayed again.
Step 5: Provide the details to Stemma
Navigate to the Admin tab of the Stemma UI and choose Connections and Add New Connection
In the pop-up window, choose Tableau as the Connection type, provide a Connection Name, and then in the dialog box fill out the fields as follows:
Base URL Example: <https://sample.online.tableau.com>
Optional: You can opt to exclude or include some items by providing the following to Stemma by email or Slack:
Name of Projects to Exclude: If you want to exclude any sample projects or test projects from Stemma, provide the names of those projects.
External Assets Types to Include (Connection type):This is uncommon, but if you want to include and link only certain types of external assets with your Tableau workbooks in Stemma, provide the names of those assets you want to include and link. Note that this may limit the user experience of Stemma, and is not recommended.
Getting Started with the Stemma Business Glossary
What Is a Business Glossary?
A business glossary is a repository of business terms and their definitions that an organization uses to ensure the same definitions are used company-wide when analyzing data. It allows data analysts, business users, and data engineers to agree on the definition of real-world terms, and how those are represented in the data; and serves as the source of truth for those definitions
Used this way, the glossary produces a common vocabulary that everyone in the organization uses. This common vocabulary is a key component of data governance, ensuring that everyone means the same thing when they use the same term. For instance, one department may define the term “customer” as a company while another department may use the term to refer to an individual. A business glossary prevents such discrepancies.
In short, a business glossary provides canonical definitions of common terms and fosters shared understanding of those terms throughout the organization.
The Stemma Business Glossary
This first version of the Stemma Business Glossary allows you to define and group terms that are used in your organization and assign owners to those terms so that the definitions remain consistent and up to date. These terms can be specific to the data sources in the Stemma catalog, or they can be terms that are used throughout the organization or any part of it.
You can also link those terms to table columns in the Stemma catalog.
In future, you will also be able to link terms to other resources in the catalog, and to queries and search results.
Using the Glossary
Populating the Glossary
To begin populating the glossary, proceed as follows.
From the navigation bar of the Stemma UI, choose Glossary:
Choose + New Group or + New Term.
In the dialog window, add a name and a definition or description.
You can use Markdown syntax for the description or definition. – A group is a collection of related terms. For example, you might want to group terms by department or function– finance, marketing, etc. – You can add a term to an existing group, or create a new term without assigning it to a group. (You can’t create a group on the fly while adding or editing a term).
– You can add related columns and one or more owners: start typing to find candidates.
When you are finished, click Save.
Updating the Glossary
To update the glossary:
From the navigation bar of the Stemma UI, choose Glossary.
To modify a group, click on the group and then click Edit. You can change the group’s attributes and also add new terms to it.
To modify a term, search for it by name, or click the > next to the group and select the term from the list;. then either:-
Click Edit and make your changes to any or all attributes of the term. You can change the group the term belongs to, so long as the new group already exists.
OR
Click Delete to remove the term from the glossary.
When you are finished, click Save.
Connecting to AWS S3
Stemma supports two methods of transferring files:.
Stemma creates an S3 bucket: you provide Stemma the ARN for a role/group that should have access to read to and write from the bucket. You then provide the role/group to the users or resources that need access to the bucket. Stemma recommends this method.
You create the S3 bucket: Stemma provides you the ARN of the role that the Stemma app uses. You will need to add the following policy to the bucket to allow Stemma access. Make sure you replace <BUCKET_NAME> and <ARN_FROM_STEMMA> with the appropriate values.