Fabric Lakehouse Health Checks Make Optimization Practical. Here’s the Runbook.

Fabric Lakehouse table health loop

Microsoft Fabric just added a small feature that can change how teams maintain Lakehouse tables.

The new sp_get_table_health_metrics stored procedure gives SQL analytics endpoint users a T-SQL way to inspect Lakehouse table health before deciding whether Spark maintenance is needed.

That sounds narrow. It is not.

For teams serving Power BI, SQL users, downstream data products, or AI workflows from Lakehouse tables, this closes an annoying operational gap: the place where users feel the slowdown is often SQL, but the maintenance action usually happens in Spark.

Until now, a lot of teams handled that gap with guesswork.

Run OPTIMIZE every night. Compact everything on a schedule. Wait until dashboards get slow. Open a notebook. Inspect Delta files. Ask support. Hope the maintenance job was worth the compute.

The better pattern is simple:

Check table health first. Optimize only when the evidence says to.

That is the practical win here.

The problem this solves

Lakehouse tables can look fine logically while becoming less efficient physically.

The schema is still valid. The row counts still make sense. The reports still refresh. But over time the physical layout can drift:

too many small files
too many deleted rows
stale or missing checkpoints
uneven row distribution
invalid or weak file statistics
fragmented table layout after frequent writes, deletes, or merges

Users usually experience this as slower SQL queries or lagging Power BI reports.

Data engineers experience it as a vague support problem:

The dashboard is slow. Can you check the Lakehouse?

That is a bad starting point. It pushes the team into reactive troubleshooting instead of evidence-based maintenance.

sp_get_table_health_metrics gives the SQL side a diagnostic step. It does not replace Spark maintenance, but it gives teams a better way to decide when Spark maintenance is actually justified.

What the stored procedure gives you

Microsoft’s announcement describes a built-in stored procedure for the SQL analytics endpoint that returns table health signals for Lakehouse tables.

The useful part is not just one metric. It is the mix of signals:

PotentialAnomalyType
PotentialAnomalyDescription
snapshot and checkpoint versions
physical row counts
deleted row counts
file size distribution
row count distribution
deleted row distribution

That gives you a much better conversation than “the table feels slow”.

You can ask more specific questions:

Is this a small-file problem?
Are deleted rows accumulating?
Is the table missing a recent checkpoint?
Are file statistics valid?
Is this table actually healthy and the issue is somewhere else?

That last point matters.

A health check can save capacity by proving that a maintenance job is not needed.

The runbook I would use

I would not treat this as a one-off troubleshooting command. I would turn it into a small operational runbook.

Fabric Lakehouse table health decision matrix

1. Start with the critical tables

Do not begin by checking every table in the Lakehouse.

Start with tables that actually matter:

tables behind important Power BI semantic models
tables queried heavily through the SQL analytics endpoint
fact tables with frequent incremental writes
tables touched by merge, delete, or update patterns
tables used as context for AI agents or downstream apps
tables with known performance complaints

This keeps the first version focused.

A table nobody queries does not need the same operational attention as the table behind the CFO dashboard.

2. Run the health check before maintenance

The basic pattern is straightforward.

EXEC sp_get_table_health_metrics @table_name = 'schema.YourTable';

In a real runbook, I would capture the output instead of only looking at it manually.

For example, create a control table that stores:

run timestamp
workspace or environment
Lakehouse name
table name
anomaly type
anomaly description
selected file distribution metrics
maintenance decision
action taken
post-check result

The goal is not to create bureaucracy. The goal is to make maintenance reviewable.

If someone asks why a table was optimized yesterday, the answer should not be “because the schedule said so”.

The answer should be tied to the health output.

3. Classify the result

Use the anomaly fields as the first decision point.

A simple classification model can work well:

None
  No maintenance action by default.
  Log the result and continue monitoring.

Too many small files
  Candidate for compaction or OPTIMIZE.
  Check whether this table is written frequently in small batches.

Too many deleted rows
  Candidate for maintenance.
  Also review the upstream write, delete, or merge pattern.

No recent checkpoint
  Review checkpoint behavior and table activity.
  Decide whether maintenance should include checkpoint handling.

Invalid file statistics
  Investigate before routine optimization.
  Do not assume compaction is the only answer.

The exact action should depend on your workload, table size, freshness needs, and Fabric capacity behavior. The important change is that the action comes after diagnosis.

4. Decide, then act

The SQL analytics endpoint can diagnose table health. It is read-only, so it cannot perform the maintenance itself.

That means the runbook needs a handoff:

SQL health check identifies the condition
orchestration layer records the result
if action is needed, a Spark notebook or Lakehouse maintenance process runs the fix
a post-check confirms the result
the control table records what happened

This is the bridge that matters.

SQL sees the pain. Spark applies the fix. The pipeline connects the two.

Fabric Lakehouse health based optimization pipeline

A practical pipeline pattern

I would implement the first version with a small scheduled pipeline.

Not fancy. Just useful.

Step 1: Table list

Maintain a small configuration table:

CREATE TABLE dbo.LakehouseMaintenanceTargets
(
    TableName varchar(256) NOT NULL,
    Priority varchar(20) NOT NULL,
    Enabled bit NOT NULL,
    Owner varchar(100) NULL,
    Notes varchar(1000) NULL
);

Start with five to ten important tables. Add more only after the pattern works.

Step 2: Health check activity

For each enabled table, run the stored procedure and capture the result.

The exact mechanics will depend on how you orchestrate SQL activity in your Fabric environment, but the operating idea is the same: store the health output, not just the final action.

A simple log table might look like this:

CREATE TABLE dbo.LakehouseTableHealthLog
(
    RunId varchar(64) NOT NULL,
    CheckedAt datetime2 NOT NULL,
    TableName varchar(256) NOT NULL,
    PotentialAnomalyType int NULL,
    PotentialAnomalyDescription varchar(4000) NULL,
    Decision varchar(50) NOT NULL,
    ActionTaken varchar(100) NULL,
    Notes varchar(4000) NULL
);

The specific metric columns can be expanded after you inspect the procedure output in your environment.

Step 3: Decision rule

Keep the first decision rule conservative.

Example:

If no anomaly is detected:
  Log HEALTHY
  Skip Spark maintenance

If a known maintenance anomaly is detected:
  Log ACTION_REQUIRED
  Trigger Spark maintenance for that table

If the anomaly is unclear:
  Log REVIEW_REQUIRED
  Notify the owner instead of running automatic maintenance

That last branch is important.

Automation should not turn every warning into a compute job. Some signals need human review, especially early in the rollout.

Step 4: Spark maintenance

When the decision is ACTION_REQUIRED, run the maintenance action through Spark or the Lakehouse engine.

For tables with a clear small-file problem, that may mean running OPTIMIZE through a notebook.

I would keep this notebook parameterized:

# Parameters supplied by the pipeline
lakehouse_table = "schema.YourTable"

spark.sql(f"OPTIMIZE {lakehouse_table}")

Do not hard-code table names in five different notebooks. Pass the table name in, log the run, and keep the action traceable.

Step 5: Post-check

After maintenance, run the health check again.

This is the part teams often skip.

If a job consumed capacity, it should produce evidence that the table health improved or at least that the expected action completed.

The post-check does three useful things:

proves the maintenance job had an effect
catches cases where optimization did not solve the issue
gives you history for future threshold decisions

What I would measure

I would track both technical health and operational impact.

Technical health:

number of tables checked
number of anomalies detected
anomaly type frequency
maintenance actions triggered
post-check status
repeated anomalies on the same table

Operational impact:

query duration before and after maintenance for key SQL queries
Power BI refresh or report interaction patterns where available
Fabric capacity consumed by maintenance jobs
skipped maintenance runs because tables were healthy
incidents or user complaints tied to Lakehouse table performance

The skipped jobs are easy to overlook, but they matter.

If the health check prevents unnecessary optimization work, that is a real platform win.

Where this fits in a Fabric operating model

This feature belongs in the same conversation as monitoring, FinOps, data product ownership, and semantic model reliability.

A Lakehouse table that feeds several important reports is not just storage. It is part of the production data path.

For those tables, I would define:

owner
expected freshness
expected query pattern
health check frequency
maintenance decision rule
escalation path
last known health status

That sounds heavier than “run optimize nightly”, but it is actually lighter over time.

The team stops paying for blind maintenance and stops waiting for users to discover performance problems.

What I would avoid

I would avoid four traps.

Trap 1: Optimizing everything because you can

A health check is useful because it lets you avoid unnecessary work.

If every signal still leads to OPTIMIZE, the runbook has failed.

Trap 2: Treating the anomaly description as the whole diagnosis

The anomaly is a starting point. Pair it with workload knowledge.

A table written every few minutes will behave differently from a monthly snapshot table. A small-file pattern may be expected in one stage and unacceptable in another.

Trap 3: Ignoring the upstream write pattern

If a table keeps accumulating small files, compaction is only part of the answer.

Look upstream:

batch size
write frequency
partitioning choices
merge patterns
delete patterns
source system behavior

Maintenance cleans up the symptom. The upstream pattern often explains why it keeps coming back.

Trap 4: Not logging the decision

If the runbook cannot explain what it did, it is not a runbook. It is another black box.

Keep a small audit trail.

Table health, decision, action, result.

That is enough for a first version.

A simple first-week rollout

If I were adding this to a Fabric environment, I would do it in this order.

Day 1: Identify targets

Pick five important Lakehouse tables.

For each one, document the owner, main consumers, refresh pattern, and why it matters.

Day 2: Run manual checks

Run sp_get_table_health_metrics manually and review the output.

Do not automate yet. First understand what healthy and unhealthy look like in your own environment.

Day 3: Create the log table

Create the health log table and start storing results.

Even if the first version is manual, logging gives you history.

Day 4: Add a conservative pipeline

Automate the health check. Let the first version notify or log, not automatically optimize every table.

Day 5: Add one maintenance action

Choose one clear condition, such as a small-file anomaly on a high-value table, and trigger a parameterized Spark maintenance notebook.

Then run the post-check.

That is enough for a useful pilot.

The practical takeaway

This is a good Fabric update because it moves Lakehouse maintenance closer to how teams actually operate.

SQL users and Power BI users usually feel the performance issue first. Spark usually fixes the physical layout. sp_get_table_health_metrics gives teams a diagnostic bridge between those two worlds.

The feature is useful on its own. It becomes much more valuable when you turn it into a runbook:

pick critical tables
check health before maintenance
classify the anomaly
act only when needed
log the decision
run a post-check
adjust upstream write patterns when the same issue returns

That is the difference between scheduled guesswork and data engineering operations.

Good maintenance starts with evidence.

Source

Microsoft Fabric Updates Blog: Know before you optimize: Diagnose Lakehouse table health with a single T-SQL command (Generally Available)