Docly

How Stemma Captures Data: SQL Parsing Overview and Caveats

Estimated reading: 3 minutes

What Stemma captures

When you execute SQL against your warehouse, the database captures the raw SQL that was executed and saves it in a historical format that includes the time that the SQL command was executed as well as the user who executed it. Stemma reads these SQL logs and parses all of the SQL commands that have been run, extracting several distinct pieces of information which appear in the Stemma UI on the Table Details page:

  • Table and column lineage: any change to a table– via an insert, update or delete from another table– will create the appropriate table and column lineage tracking.
  • User table usage: Stemma uses select statements to associate users to the table(s) being queried; this provides the Frequent Users information on the Table Details page, search rankings, and the user’s profile page.
  • Commonly Used With: any joins that occur within the SQL, whether on the top-level query or as part of a sub-query, are captured and presented on the Table Details page to help other users understand how two or more tables are commonly joined.

Caveats:

  • When lineage is created via insert into table (select * from ...) statements, Stemma does not capture column level lineage.
  • When user IDs in the database do not match the user portion of the user’s email address as recorded in Stemma, Stemma cannot associate the Stemma user to the user in the query logs. This prevents Stemma from creating any user-table usage information.
  • When system accounts are used to query the data, Stemma is often not able to see which user is executing the query. This is a common practice with BI tools when multiple team members share the same connection to a database and that connection uses a system account.
    • However, if you are using Mode, Stemma can parse the comments appended by Mode (which do contain the user ID) to find the user and associate that user to the query.
  • Commonly Used With does not capture joins where the left or right table is a dynamically generated select statement, for example: select * from table_a a join (select * from table_b b1 join table_c c1 on b1.id = c1.id) b on a.id = b.id.
  • Stemma currently supports Snowflake out of the box. Stemma can also parse RedshiftAthena, Postgres and other databases, but does not provide turnkey support for those databases at present.