Sharing additional metadata via CSV files
Along with a direct connection to various data sources, Stemma can ingest data using extracts in various file formats. One of the most commonly used file formats is CSV.
Below are some of the CSV Formats Stemma ingests today (along with the sample files), with the column names and values we expect in each file. Note that you need the column header literals (for example,db_name
, etc.) as well as the values.
Github Links
Used to provide a link to a Github source from Stemma’s Table detail page.
db_name | cluster | schema | table_name | source | source_type |
---|---|---|---|---|---|
hive | gold | test_schema | test_table1 | https://github.com/amundsen-io/amundsen/ | github |
db_name: the data source name, for example, hive
, snowflake
, athena
, redshift
, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
source: the Github URL for the repository
source_type: static value, “github”
Airflow
Used to generate a link between a table and an Airflow task. Once the data is ingested, you will see an “Airflow” button on the Stemma detail page for the table.
task_id | dag_id | exec_date | application_url_template | db_name | schema | table_name | cluster |
---|---|---|---|---|---|---|---|
hive.test_schema.test_table1 | event_test | 2018-05-31T00:00:00 | https://airflow_host.net/admin/airflow/tree?dag_id=SUPER_AWESOME_DAG | hive | test_schema | test_table1 | gold |
task_id:: the Airflow ID for the task
dag_id: the Airflow ID for the DAG
exec_date: the most recent execution timestamp for the task
application_url_template: the URL for the DAG in your Airflow instance
db_name: the data source name, for example, hive
, snowflake
, athena
, redshift
, etc.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
cluster: the name of the database within the data source.
Dagster
Used to link a Dagstr op and/or job to a table in Stemma. Once the data is ingested, you will see an “Dagstr” button on the Stemma detail page for the table.
application_url_template | db_name | cluster | schema | table_name |
---|---|---|---|---|
https://dagster_host.net/dagster/tree?job=dagster_job | snowflake | prod | test_schema | test_table1 |
application_url_template: the URL for the DAG in your dragster instance
db_name: the data source name, for example, hive
, snowflake
, athena
, redshift
, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
Table Tags
Stemma can create and assign tags via a CSV file, with the following format:
db_name | cluster | schema | table_name | tags |
---|---|---|---|---|
hive | gold | test_schema | test_table1 | “tag1,tag2” |
hive | gold | test_schema_2 | test_table2 | “tag1,tag3” |
db_name: the data source name, for example, hive
, snowflake
, athena
, redshift
, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
tags: the tags to be created. Tags are additive: existing tags on tables will not be modified.
Table Description
db_name | cluster | schema | table_name | description |
---|---|---|---|---|
hive | gold | test_schema | test_table1 | Description for table, optionally markdown. |
db_name: the data source name, for example, hive
, snowflake
, athena
, redshift
, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
description: the table description. Note that descriptions are destructive: if a description already exists, it will be overwritten.
Table Owners
db_name | cluster | schema | table_name | owners |
---|---|---|---|---|
hive | gold | test_schema | test_table1 | “[email protected],[email protected]” |
hive | gold | test_schema_2 | test_table2 | “#test-channel,#test-channel-2” |
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
owners: email addresses or slack channels. Owners are additive: existing owners of tables will not be modified.
Column Descriptions
db_name | cluster | schema | table_name | col_name | description |
---|---|---|---|---|---|
hive | gold | test_schema | test_table | test_column | This is an example column description. |
db_name: the data source name, for example, hive, snowflake, athena, redshift, etc.
cluster: the name of the database within the data source.
schema: the name of the schema within the database.
table_name: the name of the table within the schema.
col-name: the column described.
description: the column description. Note that column descriptions are destructive: if an existing description is present, it will be overwritten.