clearview/docs/TECHNICAL.md
Ivo Oskamp e304b2b3d4 Refactor scanner into modular package and add AlertHub-style frontend
- Split scanner.py into scanners/ package (entra, mailbox, sharepoint, common)
- Add Exchange Online PowerShell probe scripts under scanners/exo_scripts
- Frontend overhaul: AlertHub-style sidebar layout, dark logo asset, expanded app.js/index.html/styles.css
- Backend updates across main.py, worker.py, models.py, schemas.py, csv_import.py
- Update Dockerfile and build-and-push.sh
- Update TECHNICAL.md, changelog-develop.md, add summary changelog.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 13:49:04 +02:00

17 KiB

TECHNICAL

Scope

Clearview scans Microsoft 365 for permission deviations across two domains:

  1. SharePoint sites — deviations relative to the site root permission baseline (libraries, folders, files).
  2. Exchange Online mailboxes — non-default permissions: Full Access, Send As, Send on Behalf, and folder delegations (Calendar, Inbox).

Designed to monitor multiple customer tenants from a single instance.

Runtime Architecture

  • Single clearview application container (no separate API container).
  • postgres service for persistent job and result storage.
  • adminer service for direct database inspection.

All services are defined in stack/docker-compose.yml for Portainer deployment.

Application Layout

  • containers/clearview/site/
    • Frontend UI: vanilla HTML/JS/CSS with a fixed sidebar and hash-based routing.
    • Routes: #/dashboard, #/jobs, #/scan/sharepoint, #/scan/mailbox, #/tenants, #/settings.
  • containers/clearview/src/clearview_app/
    • FastAPI backend
    • SQLAlchemy models
    • CSV parser (SharePoint URLs and mailbox UPNs)
    • Default-site filtering (SharePoint only)
    • Background worker for long-running scans
  • containers/clearview/src/clearview_app/scanners/
    • common.pyAuthConfig, DeviationRecord, ScanResult, ProbeResult, shared helpers.
    • sharepoint.py — SharePoint REST scanner, MSAL token cache, hierarchical dedup, SharingLinks helpers.
    • mailbox.py — Exchange Online scanner; spawns pwsh with the EXO scripts.
    • exo_scripts/ — PowerShell scripts (probe.ps1, get-permissions.ps1).
    • Dispatcher: scanners.scan(scan_type, target, auth, progress) and scanners.probe(scan_type, target, auth).

Multi-Tenant Model

Clearview uses Tenant Profiles to manage multiple customer tenants from one instance.

Tenant Profiles

A tenant profile stores the Azure app credentials for one customer tenant:

Field Description
name Label for internal reference (e.g. "Contoso")
tenant_id Azure Directory (tenant) ID
client_id Azure App (client) ID
client_secret Azure App client secret

Profiles are managed via the Tenants panel in the UI or directly via the API. When starting a scan, you select a profile from the dropdown — no manual credential entry needed.

Ad-hoc scans without a saved profile are still supported via Manual credentials in the scan form.

API Endpoints — Tenants

GET    /api/tenants                              List all profiles (client_secret not returned)
POST   /api/tenants                              Create a new profile
DELETE /api/tenants/{id}                         Delete a profile (jobs are retained, tenant link is cleared)
POST   /api/tenants/{id}/generate-certificate    Generate a self-signed certificate for this tenant

Certificate Authentication

Clearview supports app-only authentication via a self-signed certificate (recommended) or a client secret.

Generating a certificate:

  1. Click Certificate in the Tenants table.
  2. Clearview generates a self-signed RSA-2048 certificate valid for 2 years.
  3. Download the .cer file and upload it in Azure Portal → App registration → Certificates & secrets → Certificates.
  4. The private key is stored internally; Clearview uses it automatically when starting a scan.

The scanner uses the certificate path when cert_thumbprint is present on the tenant profile; otherwise the client secret is used.

TenantProfile authentication fields:

Field Description
client_secret Azure client secret (optional when a certificate is available)
cert_private_key PEM-encoded private key (internal, never exposed via API)
cert_public_pem PEM-encoded public certificate (used to build a PFX for Exchange Online PowerShell)
cert_thumbprint SHA-1 thumbprint (used by MSAL)
cert_expires_at Certificate expiry date

Scan Processing Model

Scans run asynchronously through a DB-backed job queue:

  1. User selects a tenant profile (or enters manual credentials) and submits URLs or a CSV.
  2. API validates and normalizes URLs.
  3. Default sites are skipped by rule (tenant root and app catalog).
  4. A scan job is queued in PostgreSQL, linked to the tenant profile when applicable.
  5. Background worker processes targets with retries and per-target timeout.
  6. API/UI expose progress and deviations per job.

Connection Preflight

Before the full scan of a target runs, the worker performs a lightweight probe to verify that the configured credentials can actually reach the site and read role assignments. This catches the common setup errors (missing admin consent, certificate not yet uploaded to Azure, wrong tenant/client ID) early and with a clear message, instead of producing a silent 401 during the full scan.

The probe issues two calls:

  1. GET /_api/web?$select=Title — validates token + tenant + site URL.
  2. GET /_api/web/roleassignments?$top=1&$select=PrincipalId — validates that the app actually has permission to read role assignments (not only basic read).

The result is persisted per target in last_probe_at, last_probe_ok, and last_probe_message. If the probe fails, the target is marked failed with error_message = "Preflight: <hint>" and the full scan is skipped. Hints interpret common HTTP codes:

Code Hint
401 on /_api/web Certificate not uploaded in Azure, or wrong tenant/client ID
401 on /roleassignments Admin consent missing, or granted permission too low
403 App has no access to this site (e.g. Sites.Selected without a per-site grant)
404 Site not found

The same probe is exposed as an on-demand Test connection action on each target in the Job Details UI (see API Endpoints below). The action is blocked while the job is still queued or running.

Timeout and Retry Controls

Configured through environment variables (defaults shown):

Variable Default Description
SCAN_TARGET_TIMEOUT_SEC 3600 Max seconds per target before it is marked failed
SCAN_TARGET_MAX_RETRIES 2 Number of retries on transient failure
SCAN_RETRY_BASE_DELAY_SEC 2 Base delay for exponential back-off between retries
SCAN_JOB_POLL_INTERVAL_SEC 3 How often the worker polls for new queued jobs
SCAN_HTTP_TIMEOUT_SEC 30 Per-request HTTP timeout toward SharePoint
SCAN_HTTP_MAX_RETRIES 3 Retries on HTTP 429/503 or connection errors
SCAN_LIST_PAGE_SIZE 200 Items per page when listing library contents
SCAN_MAX_ITEMS_PER_LIST 10000 Cap on items with unique permissions per library

Deviation Detection

The scanner retrieves SharePoint REST role assignments at four levels:

  • Site root
  • Document library
  • Folder
  • File

Only permissions added relative to the site root are stored as deviations (delta_type=added). No filesystem/NTFS permission model is used.

Hierarchical Deduplication

After all deviations for a target are collected they are post-processed: if a (principal, role) deviation is already reported at a parent URL (library or folder), the same deviation on child items is suppressed. This prevents an explosion of results when a single folder grant propagates to thousands of files.

Deduplication is pure in-memory post-processing — no additional API calls are made.

Role Name Normalisation

SharePoint returns role names in the language configured for the tenant. Clearview normalises common Dutch role names to their English equivalents before storing them:

Dutch English
Volledig beheer Full Control
Bijdragen Contribute
Lezen Read
Bewerken Edit
Ontwerpen Design
Beperkte toegang Limited Access
Goedkeuren Approve
Hiërarchieën beheren Manage Hierarchy
Weergeven alleen View Only
Beperkt lezen Restricted Read

Unknown role names are stored as-is.

SharePoint creates internal groups named SharingLinks.{guid}.{LinkType}.{guid} whenever a user shares a file or folder via a sharing link. Clearview detects these and classifies them by risk:

Link type Risk UI colour
Anonymous* Critical Red
Flexible High Orange
Organization* Low Blue
Direct* Low Green

Resolve Sharing Links — after a scan completes, the Job Details panel shows a Resolve Sharing Links section listing all SharingLinks types found in the job. The user selects which types to resolve and clicks Resolve. Clearview calls /_api/web/sitegroups/getbyname('{name}')/users for each unique group using the job's stored credentials and writes the member list to permission_deviations.resolved_members. Anonymous links have no resolvable members; their resolved_members field is stored as an empty string, displayed as (public link) in the UI.

Anonymous and Flexible types are pre-selected by default. Organization and Direct types are available but unchecked by default.

API Endpoints — Scan Jobs

GET    /api/scan-jobs                                   List jobs (optional ?tenant_profile_id=)
POST   /api/scan-jobs                                   Create job from URLs
POST   /api/scan-jobs/import-csv                        Create job from CSV upload
GET    /api/scan-jobs/{id}                              Get job detail (targets + deviations)
POST   /api/scan-jobs/{id}/cancel                       Cancel a queued or running job
DELETE /api/scan-jobs/{id}                              Delete a completed job and all its data
POST   /api/scan-jobs/{id}/resolve-sharing-links        Resolve SharingLinks group members post-scan
POST   /api/scan-jobs/{id}/targets/{tid}/test-connection  Re-run the connection preflight for one target
GET    /api/scan-jobs/{id}/export                       Download deviations as .xlsx (optional ?site_url=)

Job Details UI

The Selected Job Details panel provides:

  • Site filter — dropdown populated from the job's targets; filters both the Targets and Deviations tables client-side without a new API call.
  • Export Excel — downloads a .xlsx with two sheets:
    • Targets: URL, status, attempts, error, timestamps
    • Deviations: Site URL, Object URL (relative to site), Object Type, Principal, Link Risk (colour-coded), Resolved Members, Role, Delta — sorted by Site URL → Object URL → Principal
  • Resolve Sharing Links — see SharingLinks section above.

CSV Import

Expected input is Microsoft Sites export format.

  • URL column is auto-detected (URL / Site URL / SiteUrl).
  • UTF-8 BOM is supported.
  • Duplicate URLs are de-duplicated.

Data Model

Main tables:

Table Key columns
tenant_profiles credentials, cert_private_key, cert_public_pem, cert_thumbprint, cert_expires_at
scan_jobs status, scan_type (sharepoint/mailbox), tenant_profile_id, progress counters, auth credentials
scan_targets job_id, site_url (holds UPN for mailbox jobs), status, attempts, error_message, last_probe_at, last_probe_ok, last_probe_message
permission_deviations job_id, site_url, object_url, object_type, principal, role_name, delta_type, permission_type, resolved_members

Scan jobs, targets, and deviations are cascade-deleted when a job is removed via DELETE /api/scan-jobs/{id}. Jobs with status queued or running cannot be deleted.

Schema migrations for new columns are applied automatically on startup via _ensure_schema_columns() in main.py.

Mailbox Scanning

Mailbox scans use Exchange Online PowerShell with certificate-based app-only auth.

What is collected

Permission PowerShell source permission_type value
Full Access (and other mailbox-level rights) Get-MailboxPermission FullAccess
Send As Get-RecipientPermission (AccessControlType=Allow) SendAs
Send on Behalf mailbox property GrantSendOnBehalfTo SendOnBehalf
Folder delegation — Calendar Get-MailboxFolderPermission "<upn>:\Calendar" Folder:Calendar
Folder delegation — Inbox Get-MailboxFolderPermission "<upn>:\Inbox" Folder:Inbox

The scanner filters out NT AUTHORITY\SELF, S-1-5-* SIDs, inherited mailbox permissions, and the default folder principals (Default, Anonymous with None rights). What remains is stored as deviations on the job — there is no SharePoint-style root baseline; every non-default principal counts.

Authentication

Mailbox scanning uses the same tenant certificate as SharePoint, but Exchange Online requires a .pfx rather than a thumbprint + raw private key. At scan time Clearview builds an in-memory PFX from cert_private_key + cert_public_pem (random password), writes it to a tempdir, and removes it immediately after the pwsh process exits.

Targets

Three ways to seed a mailbox scan job:

  1. Manual UPNs — paste one UPN per line.
  2. CSV import — column UserPrincipalName / Email / Mailbox / Primary SMTP Address (auto-detected, case-insensitive).
  3. All mailboxes in tenant — Clearview enumerates every mailbox via Get-EXOMailbox -ResultSize Unlimited and queues one target per mailbox. Requires the tenant's primary domain (e.g. contoso.onmicrosoft.com) so Connect-ExchangeOnline -Organization can authenticate. Capped at 50000 mailboxes per job.

Required Azure permissions

In addition to the SharePoint setup the scan app needs:

  • API permission: Office 365 Exchange Online → Application permissions → Exchange.ManageAsApp (admin-consented).
  • Entra role assigned to the app's service principal: Exchange Administrator (cannot be granted via Microsoft Graph; must be assigned in Azure Portal → Entra ID → Roles and administrators).

Runtime requirements

The container image installs:

  • PowerShell 7 (pwsh) from the official Microsoft package repo.
  • ExchangeOnlineManagement module from PSGallery (Install-Module -Scope AllUsers).

Adds roughly 150 MB to the image. Without these, mailbox probes return pwsh not available in runtime and scans fail.

Probe

Mailbox preflight runs probe.ps1 which connects to Exchange Online and calls Get-EXOMailbox -Identity <upn> -PropertySets Minimum. Failure hints map common errors:

Error fragment Hint
Unauthorized / 401 / AADSTS* Check Exchange.ManageAsApp permission, admin consent, and the Exchange Administrator role assignment
Couldn't find object / not found Mailbox does not exist in this tenant
module not available ExchangeOnlineManagement PS module missing in the container

Build and Release

./build-and-push.sh from the repo root, sourced from the shared script in /docker/develop/shared-integrations/tooling/docker-build-and-push/.

  • ./build-and-push.sh t — test build, push :dev tag only.
  • ./build-and-push.sh r — release build, parses the version from docs/changelog.md (first ## vX.Y.Z heading), pushes :<version>, :dev, and :latest.

The script performs no git operations. After a successful release, run the git commit / git tag / git push --tags commands the script prints in its summary.

Current Scan Mode

SHAREPOINT_SCAN_MODE=sharepoint_app_only is active by default.

Azure app-only credentials are resolved per scan job from the linked tenant profile, or from the raw credentials submitted with the job when no profile is used.

Entra App Registration — two modes

The UI automatically detects which mode is active via GET /api/onboarding/status. The onboarding flow is accessed from the Add Tenant form in the Tenants panel.

Mode A — Automated (platform app configured)

Requires a pre-registered Clearview platform app in Azure AD with permission to create apps in customer tenants (Application.ReadWrite.All on Microsoft Graph).

Set the following in stack/.env:

ONBOARDING_CLIENT_ID=<platform-app-client-id>
ONBOARDING_CLIENT_SECRET=<platform-app-client-secret>
ONBOARDING_REDIRECT_URI=https://<your-clearview-domain>/api/onboarding/microsoft/callback

Flow per customer tenant:

  1. Click Add Tenant in the UI and then Connect Microsoft.
  2. Approve admin consent in the customer's Microsoft tenant.
  3. UI receives tenant context from the OAuth callback and pre-fills the tenant ID.
  4. Click Create Scan App Automatically to create a tenant-local scan app via Graph API.
  5. Clearview assigns SharePoint Sites.FullControl.All and generates a client secret.
  6. Enter a name and click Save Tenant to store the profile.

Mode B — Manual (no platform app configured)

When ONBOARDING_* env vars are empty the UI shows step-by-step instructions to create the scan app manually per customer tenant:

  1. Azure Portal → Entra ID → App registrations → New registration (Single tenant).
  2. Copy Directory (tenant) ID and Application (client) ID from the Overview page.
  3. API permissions → Add → SharePoint → Application permissions → Sites.FullControl.All → Grant admin consent.
  4. Certificates & secrets → New client secret → copy the value (shown once).
  5. Enter the details in Add Tenant and click Save Tenant.