Skip to content

""" Performance Optimization Implementation Guide

This document explains the performance optimizations added to the Purview CLI and how to use them in your codebase. """

============================================================================

1. LAZY CLI MODULE LOADING

============================================================================

What it does:

  • Defers importing CLI modules until they are first used
  • Reduces startup time for help/version-only commands
  • Modules are loaded on-demand via LazyGroup class

How it works:

  • CLI modules are registered in _MODULE_MAP in purviewcli/cli/cli.py
  • LazyGroup intercepts command access and loads modules on first use
  • Subsequent accesses use cached module

Usage (as a feature author):

  1. Create your new CLI module (e.g., purviewcli/cli/mymodule.py)
  2. Add it to _MODULE_MAP in cli.py: _MODULE_MAP = { "mymodule": "mymodule", # Simple case "mycommand": ("mymodule", "my_command"), # If module/command names differ }

Performance impact:

  • ~200-500ms faster startup for pvw --help or pvw --version
  • More noticeable on slow systems or with many modules

============================================================================

2. CLIENT SINGLETON CACHING

============================================================================

What it does:

  • Reuses client instances (Entity, Glossary, etc.) across commands
  • Eliminates repeated credential initialization
  • Scoped by authentication profile to prevent cross-profile contamination
  • Uses weak references to allow garbage collection

How it works:

  • Each client (Entity, Glossary, etc.) is created once per profile
  • Cached under a weak reference so it can be garbage collected if needed
  • Cache key: "{profile}:{ClassName}"

Usage (as a feature author):

Replace: from purviewcli.client._entity import Entity entity_client = Entity()

With: from purviewcli.client._entity import Entity from purviewcli.client.client_cache import get_cached_client entity_client = get_cached_client(Entity, profile=ctx.obj.get("profile", "default"))

Diagnostics:

pvw diagnostics cache-stats        # View cache statistics
pvw diagnostics clear-cache        # Clear cached clients

Performance impact:

  • ~500-1500ms saved per command (DefaultAzureCredential init cost)
  • More significant in scripts with many sequential commands
  • Negligible memory overhead due to weak references

============================================================================

3. LAZY CREDENTIAL LOADING

============================================================================

What it does:

  • Defers DefaultAzureCredential initialization until first API call
  • Skips auth overhead for help/version-only commands

How it works:

  • Credentials are NOT loaded at client instantiation
  • Credentials are loaded on first _make_request() call
  • Single global credential instance shared across clients

Already implemented in:

  • PurviewClient._initialize_session() - loads creds on first request
  • All derived client classes inherit this behavior

Usage (as a feature author):

No special changes needed! The system handles this automatically. Your client automatically uses lazy credential loading.

Performance impact:

  • ~100-300ms per command (credential init deferred until needed)
  • Impacts help/version significantly

============================================================================

4. READ-QUERY CACHING

============================================================================

What it does:

  • Caches results from READ operations (search, list, read)
  • Configurable TTL (time-to-live) per entry
  • Invalidates on mutations (write, delete)
  • Cache key includes filter parameters

How it works:

  • Simple in-memory dict with expiry tracking
  • Cache key = MD5(method_name + sorted_params)
  • Excludes auth fields (--token, --profile, etc.)
  • Thread-safe with locking

Usage (as a feature author):

For simple caching:

from purviewcli.client.query_cache import get_read_query_cache

cache = get_read_query_cache()

Get with automatic fetch:

result = cache.get( method_name="glossaryListTerms", params={"--glossaryGuid": "123"}, fetch_fn=lambda p: my_api_call(p), # Called on cache miss ttl_seconds=60, # Optional custom TTL )

Or manage manually:

result = cache.get("glossaryListTerms", {"--glossaryGuid": "123"}) if result is None: result = my_api_call(...) cache.put("glossaryListTerms", {"--glossaryGuid": "123"}, result)

Invalidate on mutations:

cache.invalidate("glossaryListTerms") # Clear specific method cache.invalidate() # Clear all

Configuration:

  • Default TTL: 60 seconds (configurable in client/query_cache.py)
  • Can override per-call with ttl_seconds parameter
  • TTL of 0 = no expiry

Diagnostics:

pvw diagnostics cache-stats        # See cache hit rate
pvw diagnostics clear-cache        # Reset cache

Performance impact:

  • 10-50ms per cached hit vs 500-3000ms per API call
  • Most beneficial for bulk operations or scripts with repeated queries
  • Hit rate depends on query patterns

============================================================================

5. TABLE RENDERING CACHE

============================================================================

What it does:

  • Pre-computes and reuses Rich table schemas
  • Avoids regenerating column definitions and styles

How it works:

  • TableSchemaCache stores column definitions by schema name
  • New tables are created from cached schema templates
  • Only data rows differ between instances

Pre-registered schemas:

  • "entity_summary": GUID, Type, Name, Qualified Name
  • "entity_list": GUID, Type, Name, Status
  • "glossary_terms": Term GUID, Name, Glossary, Status, Created
  • "classifications": Classification, Count, Percentage
  • "lineage_graph": Entity, Type, Level, Direction
  • "search_results": GUID, Type, Name, Owner, Score

Usage (as a feature author):

from purviewcli.cli.table_cache import create_cached_table

Create table from cached schema

table = create_cached_table("entity_list", title="My Entities")

Add data rows (only these vary, headers are cached)

table.add_row("guid-123", "DataSet", "MyDataset", "ACTIVE") table.add_row("guid-456", "Process", "MyProcess", "ACTIVE")

console.print(table)

Register new custom schema:

from purviewcli.cli.table_cache import get_table_cache

cache = get_table_cache() cache.register_schema("my_schema", [ {"header": "ID", "style": "cyan"}, {"header": "Name", "style": "bold white"}, {"header": "Status", "style": "yellow"}, ])

Use it:

table = create_cached_table("my_schema", title="Custom Report")

Performance impact:

  • ~10-50ms per table render vs ~50-100ms from scratch
  • Minimal memory overhead (schemas are shared)
  • More noticeable when generating many similar tables

============================================================================

6. BATCH API REQUEST SUPPORT (Not yet implemented)

============================================================================

Current status: PLANNED

This optimization would: - Aggregate sequential API calls into batch requests (where supported) - Respect 200ms throttling while coalescing requests - Reduce round-trip latency for bulk operations

Implementation requires: - Identifying which Purview endpoints support batching - Detecting bulk request patterns - Request coalescing in the API client layer

This is a future enhancement that can be implemented incrementally.


============================================================================

BEST PRACTICES

============================================================================

  1. Use the caches:
  2. Always use get_cached_client() instead of direct instantiation
  3. Use read-query caching for frequently-called operations
  4. Use table cache for report generation

  5. Invalidate smartly:

  6. Call cache.invalidate() after mutations (create, update, delete)
  7. Don't cache frequently changing data
  8. Use appropriate TTLs (60s for stable data, shorter for volatile)

  9. Profile-aware:

  10. Client caches are profile-scoped (no cross-profile contamination)
  11. Clear cache when switching profiles: diagnostics clear-cache

  12. Monitor cache health:

  13. Run pvw diagnostics cache-stats to check hit rates
  14. Low hit rate = re-tune TTLs or caching strategy
  15. High memory usage = shorten TTLs or invalidate more often

  16. Avoid anti-patterns:

  17. Don't use disk-based caching (stateless CLI principle)
  18. Don't cache mutable objects without deep copying
  19. Don't cache across different subscriptions without scoping

============================================================================

MEASURING PERFORMANCE IMPROVEMENTS

============================================================================

Before optimizations:

Measure-Command { pvw entity read --guid <guid> } | Select TotalMilliseconds

After optimizations:

Measure-Command { pvw entity read --guid <guid> } | Select TotalMilliseconds

Expected improvements: - First command: ~10-15% faster (lazy loading benefits) - Second identical command: ~40-60% faster (client caching benefits) - Repeated commands in script: ~70-80% faster (combined caching benefits)

Bulk operation profiling:

$sw = [Diagnostics.Stopwatch]::StartNew()
pvw entity delete --guid <guid1>, <guid2>, <guid3>
$sw.Stop()
Write-Host "Elapsed: $($sw.ElapsedMilliseconds)ms"

============================================================================

TROUBLESHOOTING

============================================================================

Q: Cache seems outdated A: Run pvw diagnostics clear-cache to reset

Q: Cross-profile data contamination A: Verify cache is profile-scoped. Should be automatic via LazyGroup.

Q: Memory usage increasing A: Check cache stats with pvw diagnostics cache-stats Reduce TTL or invalidate more aggressively

Q: Commands slower after optimization A: Check for credential initialization issues Enable debug: pvw --debug <command>

"""

if name == "main": print(doc)