clearview/docs/TECHNICAL.md
Ivo Oskamp b8446c0665 Initial commit — Clearview v0.1.0
Full application including FastAPI backend, PostgreSQL data model,
background scan worker, multi-tenant support, certificate authentication,
SharePoint REST scanner with hierarchical deduplication, SharingLinks
classification and post-scan resolve, Excel export, site filter in job
details, role name normalisation, and updated documentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:50:41 +02:00

11 KiB

TECHNICAL

Scope

Clearview scans SharePoint sites for permission deviations from the site root permission baseline. Designed to monitor multiple customer tenants from a single instance.

Runtime Architecture

  • Single clearview application container (no separate API container).
  • postgres service for persistent job and result storage.
  • adminer service for direct database inspection.

All services are defined in stack/docker-compose.yml for Portainer deployment.

Application Layout

  • containers/clearview/site/
    • Frontend UI (tenant management, manual URL input, CSV import, jobs, deviations)
  • containers/clearview/src/clearview_app/
    • FastAPI backend
    • SQLAlchemy models
    • CSV parser
    • Default-site filtering
    • Background worker for long-running scans

Multi-Tenant Model

Clearview uses Tenant Profiles to manage multiple customer tenants from one instance.

Tenant Profiles

A tenant profile stores the Azure app credentials for one customer tenant:

Field Description
name Label for internal reference (e.g. "Contoso")
tenant_id Azure Directory (tenant) ID
client_id Azure App (client) ID
client_secret Azure App client secret

Profiles are managed via the Tenants panel in the UI or directly via the API. When starting a scan, you select a profile from the dropdown — no manual credential entry needed.

Ad-hoc scans without a saved profile are still supported via Manual credentials in the scan form.

API Endpoints — Tenants

GET    /api/tenants                              List all profiles (client_secret not returned)
POST   /api/tenants                              Create a new profile
DELETE /api/tenants/{id}                         Delete a profile (jobs are retained, tenant link is cleared)
POST   /api/tenants/{id}/generate-certificate    Generate a self-signed certificate for this tenant

Certificate Authentication

Clearview supports app-only authentication via a self-signed certificate (recommended) or a client secret.

Generating a certificate:

  1. Click Certificate in the Tenants table.
  2. Clearview generates a self-signed RSA-2048 certificate valid for 2 years.
  3. Download the .cer file and upload it in Azure Portal → App registration → Certificates & secrets → Certificates.
  4. The private key is stored internally; Clearview uses it automatically when starting a scan.

The scanner uses the certificate path when cert_thumbprint is present on the tenant profile; otherwise the client secret is used.

TenantProfile authentication fields:

Field Description
client_secret Azure client secret (optional when a certificate is available)
cert_private_key PEM-encoded private key (internal, never exposed via API)
cert_thumbprint SHA-1 thumbprint (used by MSAL)
cert_expires_at Certificate expiry date

Scan Processing Model

Scans run asynchronously through a DB-backed job queue:

  1. User selects a tenant profile (or enters manual credentials) and submits URLs or a CSV.
  2. API validates and normalizes URLs.
  3. Default sites are skipped by rule (tenant root and app catalog).
  4. A scan job is queued in PostgreSQL, linked to the tenant profile when applicable.
  5. Background worker processes targets with retries and per-target timeout.
  6. API/UI expose progress and deviations per job.

Timeout and Retry Controls

Configured through environment variables (defaults shown):

Variable Default Description
SCAN_TARGET_TIMEOUT_SEC 3600 Max seconds per target before it is marked failed
SCAN_TARGET_MAX_RETRIES 2 Number of retries on transient failure
SCAN_RETRY_BASE_DELAY_SEC 2 Base delay for exponential back-off between retries
SCAN_JOB_POLL_INTERVAL_SEC 3 How often the worker polls for new queued jobs
SCAN_HTTP_TIMEOUT_SEC 30 Per-request HTTP timeout toward SharePoint
SCAN_HTTP_MAX_RETRIES 3 Retries on HTTP 429/503 or connection errors
SCAN_LIST_PAGE_SIZE 200 Items per page when listing library contents
SCAN_MAX_ITEMS_PER_LIST 10000 Cap on items with unique permissions per library

Deviation Detection

The scanner retrieves SharePoint REST role assignments at four levels:

  • Site root
  • Document library
  • Folder
  • File

Only permissions added relative to the site root are stored as deviations (delta_type=added). No filesystem/NTFS permission model is used.

Hierarchical Deduplication

After all deviations for a target are collected they are post-processed: if a (principal, role) deviation is already reported at a parent URL (library or folder), the same deviation on child items is suppressed. This prevents an explosion of results when a single folder grant propagates to thousands of files.

Deduplication is pure in-memory post-processing — no additional API calls are made.

Role Name Normalisation

SharePoint returns role names in the language configured for the tenant. Clearview normalises common Dutch role names to their English equivalents before storing them:

Dutch English
Volledig beheer Full Control
Bijdragen Contribute
Lezen Read
Bewerken Edit
Ontwerpen Design
Beperkte toegang Limited Access
Goedkeuren Approve
Hiërarchieën beheren Manage Hierarchy
Weergeven alleen View Only
Beperkt lezen Restricted Read

Unknown role names are stored as-is.

SharePoint creates internal groups named SharingLinks.{guid}.{LinkType}.{guid} whenever a user shares a file or folder via a sharing link. Clearview detects these and classifies them by risk:

Link type Risk UI colour
Anonymous* Critical Red
Flexible High Orange
Organization* Low Blue
Direct* Low Green

Resolve Sharing Links — after a scan completes, the Job Details panel shows a Resolve Sharing Links section listing all SharingLinks types found in the job. The user selects which types to resolve and clicks Resolve. Clearview calls /_api/web/sitegroups/getbyname('{name}')/users for each unique group using the job's stored credentials and writes the member list to permission_deviations.resolved_members. Anonymous links have no resolvable members; their resolved_members field is stored as an empty string, displayed as (public link) in the UI.

Anonymous and Flexible types are pre-selected by default. Organization and Direct types are available but unchecked by default.

API Endpoints — Scan Jobs

GET    /api/scan-jobs                                   List jobs (optional ?tenant_profile_id=)
POST   /api/scan-jobs                                   Create job from URLs
POST   /api/scan-jobs/import-csv                        Create job from CSV upload
GET    /api/scan-jobs/{id}                              Get job detail (targets + deviations)
POST   /api/scan-jobs/{id}/cancel                       Cancel a queued or running job
DELETE /api/scan-jobs/{id}                              Delete a completed job and all its data
POST   /api/scan-jobs/{id}/resolve-sharing-links        Resolve SharingLinks group members post-scan
GET    /api/scan-jobs/{id}/export                       Download deviations as .xlsx (optional ?site_url=)

Job Details UI

The Selected Job Details panel provides:

  • Site filter — dropdown populated from the job's targets; filters both the Targets and Deviations tables client-side without a new API call.
  • Export Excel — downloads a .xlsx with two sheets:
    • Targets: URL, status, attempts, error, timestamps
    • Deviations: Site URL, Object URL (relative to site), Object Type, Principal, Link Risk (colour-coded), Resolved Members, Role, Delta — sorted by Site URL → Object URL → Principal
  • Resolve Sharing Links — see SharingLinks section above.

CSV Import

Expected input is Microsoft Sites export format.

  • URL column is auto-detected (URL / Site URL / SiteUrl).
  • UTF-8 BOM is supported.
  • Duplicate URLs are de-duplicated.

Data Model

Main tables:

Table Key columns
tenant_profiles credentials, cert_private_key, cert_thumbprint, cert_expires_at
scan_jobs status, tenant_profile_id, progress counters, auth credentials
scan_targets job_id, site_url, status, attempts, error_message
permission_deviations job_id, site_url, object_url, object_type, principal, role_name, delta_type, resolved_members

Scan jobs, targets, and deviations are cascade-deleted when a job is removed via DELETE /api/scan-jobs/{id}. Jobs with status queued or running cannot be deleted.

Schema migrations for new columns are applied automatically on startup via _ensure_schema_columns() in main.py.

Build and Release

Use ./build-and-push.sh from repo root.

  • ./build-and-push.sh t for test build (:dev tag only)
  • ./build-and-push.sh 1 patch release
  • ./build-and-push.sh 2 minor release
  • ./build-and-push.sh 3 major release

Current Scan Mode

SHAREPOINT_SCAN_MODE=sharepoint_app_only is active by default.

Azure app-only credentials are resolved per scan job from the linked tenant profile, or from the raw credentials submitted with the job when no profile is used.

Entra App Registration — two modes

The UI automatically detects which mode is active via GET /api/onboarding/status. The onboarding flow is accessed from the Add Tenant form in the Tenants panel.

Mode A — Automated (platform app configured)

Requires a pre-registered Clearview platform app in Azure AD with permission to create apps in customer tenants (Application.ReadWrite.All on Microsoft Graph).

Set the following in stack/.env:

ONBOARDING_CLIENT_ID=<platform-app-client-id>
ONBOARDING_CLIENT_SECRET=<platform-app-client-secret>
ONBOARDING_REDIRECT_URI=https://<your-clearview-domain>/api/onboarding/microsoft/callback

Flow per customer tenant:

  1. Click Add Tenant in the UI and then Connect Microsoft.
  2. Approve admin consent in the customer's Microsoft tenant.
  3. UI receives tenant context from the OAuth callback and pre-fills the tenant ID.
  4. Click Create Scan App Automatically to create a tenant-local scan app via Graph API.
  5. Clearview assigns SharePoint Sites.FullControl.All and generates a client secret.
  6. Enter a name and click Save Tenant to store the profile.

Mode B — Manual (no platform app configured)

When ONBOARDING_* env vars are empty the UI shows step-by-step instructions to create the scan app manually per customer tenant:

  1. Azure Portal → Entra ID → App registrations → New registration (Single tenant).
  2. Copy Directory (tenant) ID and Application (client) ID from the Overview page.
  3. API permissions → Add → SharePoint → Application permissions → Sites.FullControl.All → Grant admin consent.
  4. Certificates & secrets → New client secret → copy the value (shown once).
  5. Enter the details in Add Tenant and click Save Tenant.