Docly

Sharing additional metadata via CSV files

Estimated reading: 5 minutes

Along with a direct connection to various data sources, Stemma can ingest data using extracts in various file formats. One of the most commonly used file formats is CSV.

Below are some of the CSV Formats Stemma ingests today (along with the sample files), with the column names and values we expect in each file. Note that you need the column header literals (for example,db_name, etc.) as well as the values.

Github Links

Used to provide a link to a Github source from Stemma’s Table detail page.

db_nameclusterschematable_namesourcesource_type
hivegoldtest_schematest_table1https://github.com/amundsen-io/amundsen/github

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
source: the Github URL for the repository
source_type: static value, “github”

Airflow

Used to generate a link between a table and an Airflow task. Once the data is ingested, you will see an “Airflow” button on the Stemma detail page for the table.

task_iddag_idexec_dateapplication_url_templatedb_nameschematable_namecluster
hive.test_schema.test_table1event_test2018-05-31T00:00:00https://airflow_host.net/admin/airflow/tree?dag_id=SUPER_AWESOME_DAGhivetest_schematest_table1gold

task_id:: the Airflow ID for the task
dag_id: the Airflow ID for the DAG
exec_date: the most recent execution timestamp for the task
application_url_template: the URL for the DAG in your Airflow instance
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
cluster: the name of the database within the data source.

Dagster

Used to link a Dagstr op and/or job to a table in Stemma. Once the data is ingested, you will see an “Dagstr” button on the Stemma detail page for the table.

application_url_templatedb_nameclusterschematable_name
https://dagster_host.net/dagster/tree?job=dagster_jobsnowflakeprodtest_schematest_table1

application_url_template: the URL for the DAG in your dragster instance
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.

Dbt Cloud

Used to link a Dbt Cloud job to a table in Stemma. Once the data is ingested, you will see a “Dbt” button on the Stemma detail page for the table.

application_url_templatedb_nameclusterschematable_name
https://dbtcloud_host.net/dbtcloud/tree?job=dbtcloud_jobsnowflakeprodtest_schematest_table1

application_url_template: the URL for the DAG in your Dbt Cloud
db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.

Table Tags

Stemma can create and assign tags via a CSV file, with the following format:

db_nameclusterschematable_nametags
hivegoldtest_schematest_table1“tag1,tag2”
hivegoldtest_schema_2test_table2“tag1,tag3”

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
tags: the tags to be created. Tags are additive: existing tags on tables will not be modified.

Table Description

db_nameclusterschematable_namedescription
hivegoldtest_schematest_table1Description for table, optionally markdown.

db_name: the data source name, for example, hivesnowflakeathenaredshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
description: the table description. Note that descriptions are destructive: if a description already exists, it will be overwritten.

Table Owners

db_nameclusterschematable_nameowners
hivegoldtest_schematest_table1[email protected],[email protected]
hivegoldtest_schema_2test_table2“#test-channel,#test-channel-2”

db_name: the data source name, for example, hive, snowflake, athena, redshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
owners: email addresses or slack channels. Owners are additive: existing owners of tables will not be modified.

Column Descriptions

db_nameclusterschematable_namecol_namedescription
hivegoldtest_schematest_tabletest_columnThis is an example column description.

db_name: the data source name, for example, hive, snowflake, athena, redshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
col-name: the column described.
description: the column description. Note that column descriptions are destructive: if an existing description is present, it will be overwritten.