Skip to content

Graph-Based Lineage

SQL Glider's graph feature lets you build, merge, and query column-level lineage across multiple SQL files. This is the key feature for understanding data flow at scale — across an entire SQL codebase rather than a single query.

Why Graph Lineage?

The basic sqlglider lineage command analyzes one SQL file at a time. That works for understanding a single query, but real data pipelines span many files: staging tables feed into dimensions, dimensions feed into reports, and so on.

The graph commands connect the dots across files, so you can answer questions like:

  • "What raw sources ultimately feed into this report column?"
  • "If I change this column in the staging table, what downstream reports break?"

Tutorial: Your First Lineage Graph

This walkthrough uses a simple three-file pipeline to demonstrate the full workflow.

Step 1: Set Up Example SQL Files

Create a queries/ directory with three files representing a typical data pipeline:

queries/customers.sql — Staging layer

-- Customer dimension table
SELECT
    customer_id,
    customer_name,
    email,
    created_at
FROM raw_customers;

queries/orders.sql — Staging layer

-- Orders fact table
SELECT
    order_id,
    customer_id,
    order_total,
    order_date,
    status
FROM raw_orders;

queries/reports.sql — Reporting layer (joins the two staging tables)

-- Customer orders summary report
SELECT
    c.customer_name,
    c.email,
    COUNT(o.order_id) AS total_orders,
    SUM(o.order_total) AS total_spent,
    MAX(o.order_date) AS last_order_date
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name, c.email;

Step 2: Build the Graph

Build a lineage graph from the entire directory:

sqlglider graph build queries/ -o graph.json

This recursively scans queries/ for .sql files, parses each one, and writes the combined lineage graph to graph.json.

Useful build options

# Non-recursive (current directory only)
sqlglider graph build queries/ -o graph.json --no-recursive

# Custom file pattern
sqlglider graph build queries/ -g "*.spark.sql" -o graph.json

# Specific dialect
sqlglider graph build queries/ -o graph.json --dialect postgres

Step 3: Query Upstream Lineage

Find out where a report column gets its data from:

sqlglider graph query graph.json --upstream total_spent

Example output:

                              Sources for 'total_spent'
+---------------------------------------------------------------------------------------------+
| Column      | Table  | Hops | Root | Leaf | Paths                             | File        |
|-------------+--------+------+------+------+-----------------------------------+-------------|
| order_total | orders |    1 |  Y   |  N   | orders.order_total -> total_spent | reports.sql |
+---------------------------------------------------------------------------------------------+

Total: 1 column(s)

Here's how to read each column:

Field Description
Column The source column name
Table The table that column belongs to
Hops How many steps away from the queried column (1 = direct source)
Root Y if this is an ultimate source with no further upstream dependencies
Leaf Y if this is a final output with no further downstream consumers
Paths The full data flow path, read left to right with -> separating each step
File Which SQL file defines this relationship

In this example, total_spent has one source: orders.order_total from reports.sql. It's 1 hop away, and it's a root node (Y) meaning orders.order_total is an ultimate source — there's nothing further upstream in this graph. It's not a leaf (N) because it feeds into total_spent.

Step 4: Query Downstream Lineage

Now ask the reverse question — if orders.order_total changes, what's affected?

sqlglider graph query graph.json --downstream orders.order_total

Example output:

                          Affected Columns for 'orders.order_total'
+--------------------------------------------------------------------------------------------+
| Column      | Table | Hops | Root | Leaf | Paths                             | File        |
|-------------+-------+------+------+------+-----------------------------------+-------------|
| total_spent |       |    1 |  N   |  Y   | orders.order_total -> total_spent | reports.sql |
+--------------------------------------------------------------------------------------------+

Total: 1 column(s)

This tells you that changing orders.order_total would impact total_spent, which is a leaf node (Y) — a final output with no further downstream consumers.

Impact analysis

Downstream queries are the foundation of impact analysis. Before changing a source column, run a downstream query to understand the blast radius.

Step 5: Export Results

Query results support multiple output formats:

# JSON for programmatic use
sqlglider graph query graph.json --upstream total_spent -f json

# CSV for spreadsheets
sqlglider graph query graph.json --upstream total_spent -f csv

Example: Upstream Lineage in JSON

JSON output is especially useful when you want to feed lineage data into other tools — CI checks, data catalogs, documentation generators, or custom dashboards. Here's what the upstream query for total_spent looks like in JSON:

sqlglider graph query graph.json --upstream total_spent -f json
{
  "query_column": "total_spent",
  "direction": "upstream",
  "count": 1,
  "columns": [
    {
      "identifier": "orders.order_total",
      "file_path": "queries/reports.sql",
      "query_index": 0,
      "schema_name": null,
      "table": "orders",
      "column": "order_total",
      "hops": 1,
      "output_column": "total_spent",
      "is_root": true,
      "is_leaf": false,
      "paths": [
        ["orders.order_total", "total_spent"]
      ]
    }
  ]
}

Each entry in columns tells you:

Field Description
identifier Fully qualified source column (table.column)
file_path The SQL file that defines this relationship
query_index Which query in the file (0-based, for multi-statement files)
table / column Parsed components of the identifier
hops Distance from the queried column (1 = direct source)
is_root true if this column has no further upstream sources
is_leaf true if this column has no downstream consumers
paths Every path from this source to the queried column, as ordered node lists

Tracing lineage across files

The file_path field is particularly valuable in large graphs spanning dozens or hundreds of SQL files. It tells you exactly which file defines each relationship, so you can trace a column's journey across your entire pipeline — from raw ingestion scripts through staging layers to final reports — without having to search through files manually.

Using JSON output in CI

JSON output pairs well with tools like jq for scripted checks. For example, to list all root source tables feeding a report column:

sqlglider graph query graph.json --upstream total_spent -f json \
  | jq -r '.columns[] | select(.is_root) | .table'

Visualizing Lineage

SQL Glider can generate diagrams from lineage graphs in Mermaid, DOT (Graphviz), and Plotly formats. Mermaid and DOT are text-based diagram languages that render in many tools — Mermaid works natively in GitHub Markdown, GitLab, Notion, and more; DOT can be rendered with Graphviz into SVG, PNG, or PDF. Plotly outputs JSON that can be loaded into interactive Plotly/Dash applications.

Visualize an Entire Graph

The graph visualize command renders every node and edge in a graph file:

# Mermaid (default)
sqlglider graph visualize graph.json

# DOT (Graphviz)
sqlglider graph visualize graph.json -f dot

# Plotly JSON (for interactive visualization)
sqlglider graph visualize graph.json -f plotly

# Save to file
sqlglider graph visualize graph.json -o lineage.mmd
sqlglider graph visualize graph.json -f dot -o lineage.dot
sqlglider graph visualize graph.json -f plotly -o lineage.json

Diagram Output from Queries

The graph query command also supports diagram formats. This is useful for visualizing just the upstream or downstream subgraph for a specific column:

# Mermaid diagram of upstream sources
sqlglider graph query graph.json --upstream total_spent -f mermaid

# DOT diagram of downstream impact
sqlglider graph query graph.json --downstream orders.order_total -f dot

# Plotly JSON for interactive exploration
sqlglider graph query graph.json --upstream total_spent -f plotly

Query diagrams include color-coded nodes, edge labels showing the source SQL file, and a legend:

Color Meaning
Amber The queried column
Teal Root node (no upstream dependencies)
Violet Leaf node (no downstream consumers)

Each edge in the diagram is labeled with the SQL filename that defines that relationship, making it easy to trace data flow back to the source code.

Example: Imagine a pipeline where a revenue report column draws from multiple sources through several transformation layers. Querying --upstream revenue would produce a diagram like this:

flowchart TD
    raw_orders_amount["raw_orders.amount"]
    raw_orders_tax["raw_orders.tax"]
    raw_exchange_rates_rate["raw_exchange_rates.rate"]
    staging_orders_subtotal["staging_orders.subtotal"]
    staging_orders_tax_amount["staging_orders.tax_amount"]
    mart_orders_total_usd["mart_orders.total_usd"]
    revenue["revenue"]
    raw_orders_amount -->|staging_orders.sql| staging_orders_subtotal
    raw_orders_tax -->|staging_orders.sql| staging_orders_tax_amount
    staging_orders_subtotal -->|mart_orders.sql| mart_orders_total_usd
    staging_orders_tax_amount -->|mart_orders.sql| mart_orders_total_usd
    raw_exchange_rates_rate -->|mart_orders.sql| mart_orders_total_usd
    mart_orders_total_usd -->|reports.sql| revenue

    style revenue fill:#e6a843,stroke:#b8860b,stroke-width:3px
    style raw_orders_amount fill:#4ecdc4,stroke:#2b9e96
    style raw_orders_tax fill:#4ecdc4,stroke:#2b9e96
    style raw_exchange_rates_rate fill:#4ecdc4,stroke:#2b9e96

    subgraph Legend
        legend_queried["Queried Column"]
        legend_root["Root (no upstream)"]
        legend_leaf["Leaf (no downstream)"]
    end
    style legend_queried fill:#e6a843,stroke:#b8860b,stroke-width:3px
    style legend_root fill:#4ecdc4,stroke:#2b9e96
    style legend_leaf fill:#c084fc,stroke:#7c3aed

The amber node is the column you queried (revenue), teal nodes are ultimate root sources with no further upstream dependencies (raw_orders.amount, raw_orders.tax, raw_exchange_rates.rate), and intermediate nodes (staging_orders.*, mart_orders.*) appear in the default style. Each edge is labeled with the SQL file that defines that relationship. The legend is included automatically.

Mermaid Markdown Format

The mermaid-markdown format wraps the diagram in a fenced code block, ready to paste directly into a .md file:

# Output a markdown-ready Mermaid block
sqlglider graph query graph.json --upstream total_spent -f mermaid-markdown

# Works with visualize too
sqlglider graph visualize graph.json -f mermaid-markdown -o lineage.md

This produces output with the ```mermaid fence included, so the diagram renders automatically when viewed in GitHub, GitLab, or any markdown tool with Mermaid support.

Plotly Interactive Visualization

The plotly format outputs a JSON figure specification that can be loaded into Plotly or Dash applications for interactive visualization with zooming, panning, and hover details.

Optional Dependency

Plotly output requires an optional dependency. Install it with:

pip install sql-glider[plotly]

Better Layout with Graphviz

For complex graphs with many nodes and edges, installing Graphviz on your system enables an optimized layout algorithm that minimizes edge crossings. Without Graphviz, a simpler layout is used that may result in overlapping edges.

# Install Graphviz first (system package)
# Windows: choco install graphviz
# macOS: brew install graphviz
# Linux: apt install graphviz graphviz-dev

# Then install sql-glider with plotly extra
pip install sql-glider[plotly]

Example usage with Dash:

import plotly.io as pio
from dash import Dash, dcc, html

# Load the JSON output from sqlglider
with open("lineage.json") as f:
    fig = pio.from_json(f.read())

# Create an interactive Dash app
app = Dash(__name__)
app.layout = html.Div([
    html.H1("Lineage Graph"),
    dcc.Graph(figure=fig, style={"height": "80vh"})
])

if __name__ == "__main__":
    app.run(debug=True)

The Plotly output uses the same color scheme as Mermaid and DOT diagrams:

Color Meaning
Amber The queried column
Teal Root node (no upstream dependencies)
Violet Leaf node (no downstream consumers)

See an example of a plot loaded into Dash below:

Plotly-Dash Example

Rendering Diagrams

Mermaid:

  • Paste into any GitHub Markdown file or issue — it renders automatically
  • Use the Mermaid Live Editor for interactive previewing
  • Install mermaid-cli for local PNG/SVG export: mmdc -i lineage.mmd -o lineage.png

DOT (Graphviz):

  • Install Graphviz and render with: dot -Tpng lineage.dot -o lineage.png
  • Use -Tsvg for scalable vector output
  • Many IDEs have Graphviz preview extensions

Plotly:

  • Load the JSON into any Plotly-compatible environment (Python, Dash, Jupyter notebooks)
  • SQL Glider includes a viewer script: python examples/plotly_viewer.py lineage.json
  • Export to static images with fig.write_image("lineage.png")

Building Graphs from Multiple Sources

Explicit File List

sqlglider graph build file1.sql file2.sql file3.sql -o graph.json

Manifest File

For large projects with mixed dialects, use a manifest CSV:

file_path,dialect
queries/spark/orders.sql,spark
queries/postgres/users.sql,postgres
queries/snowflake/reports.sql,snowflake
sqlglider graph build --manifest manifest.csv -o graph.json

Directory with Custom Glob

# Only process files matching a pattern
sqlglider graph build ./sql/ -g "*.spark.sql" -o graph.json

Merging Graphs

If you build graphs separately (e.g., different teams or CI jobs), you can merge them:

# Merge specific files
sqlglider graph merge team_a.json team_b.json -o combined.json

# Merge with a glob pattern
sqlglider graph merge --glob "graphs/*.json" -o combined.json

Merging deduplicates nodes and edges automatically — if the same column appears in multiple graphs, it's stored once in the merged result.

Schema Resolution

By default, SQL Glider resolves lineage based on what it can infer from the SQL alone. For queries with SELECT *, this may not be enough. Schema resolution enables a two-pass approach:

# Pass 1 extracts schema from your SQL, Pass 2 uses it for resolution
sqlglider graph build queries/ -o graph.json --resolve-schema

You can also provide a schema file directly:

sqlglider graph build queries/ -o graph.json --provide-schema schema.json

Or dump the extracted schema for inspection:

sqlglider graph build queries/ -o graph.json --resolve-schema --dump-schema schema.json

Templating Integration

If your SQL files use Jinja2 templates, pass templating options during the build:

sqlglider graph build queries/ -o graph.json \
    --templater jinja \
    --var schema=production \
    --var env=prod