clearview/docs/TECHNICAL.md
Ivo Oskamp b8446c0665 Initial commit — Clearview v0.1.0
Full application including FastAPI backend, PostgreSQL data model,
background scan worker, multi-tenant support, certificate authentication,
SharePoint REST scanner with hierarchical deduplication, SharingLinks
classification and post-scan resolve, Excel export, site filter in job
details, role name normalisation, and updated documentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:50:41 +02:00

253 lines
11 KiB
Markdown

# TECHNICAL
## Scope
Clearview scans SharePoint sites for permission deviations from the site root permission baseline.
Designed to monitor multiple customer tenants from a single instance.
## Runtime Architecture
- Single `clearview` application container (no separate API container).
- `postgres` service for persistent job and result storage.
- `adminer` service for direct database inspection.
All services are defined in `stack/docker-compose.yml` for Portainer deployment.
## Application Layout
- `containers/clearview/site/`
- Frontend UI (tenant management, manual URL input, CSV import, jobs, deviations)
- `containers/clearview/src/clearview_app/`
- FastAPI backend
- SQLAlchemy models
- CSV parser
- Default-site filtering
- Background worker for long-running scans
## Multi-Tenant Model
Clearview uses **Tenant Profiles** to manage multiple customer tenants from one instance.
### Tenant Profiles
A tenant profile stores the Azure app credentials for one customer tenant:
| Field | Description |
|---|---|
| `name` | Label for internal reference (e.g. "Contoso") |
| `tenant_id` | Azure Directory (tenant) ID |
| `client_id` | Azure App (client) ID |
| `client_secret` | Azure App client secret |
Profiles are managed via the **Tenants** panel in the UI or directly via the API.
When starting a scan, you select a profile from the dropdown — no manual credential entry needed.
Ad-hoc scans without a saved profile are still supported via **Manual credentials** in the scan form.
### API Endpoints — Tenants
```
GET /api/tenants List all profiles (client_secret not returned)
POST /api/tenants Create a new profile
DELETE /api/tenants/{id} Delete a profile (jobs are retained, tenant link is cleared)
POST /api/tenants/{id}/generate-certificate Generate a self-signed certificate for this tenant
```
### Certificate Authentication
Clearview supports app-only authentication via a self-signed certificate (recommended) or a client secret.
**Generating a certificate:**
1. Click **Certificate** in the Tenants table.
2. Clearview generates a self-signed RSA-2048 certificate valid for 2 years.
3. Download the `.cer` file and upload it in Azure Portal → App registration → Certificates & secrets → Certificates.
4. The private key is stored internally; Clearview uses it automatically when starting a scan.
The scanner uses the certificate path when `cert_thumbprint` is present on the tenant profile; otherwise the client secret is used.
`TenantProfile` authentication fields:
| Field | Description |
|---|---|
| `client_secret` | Azure client secret (optional when a certificate is available) |
| `cert_private_key` | PEM-encoded private key (internal, never exposed via API) |
| `cert_thumbprint` | SHA-1 thumbprint (used by MSAL) |
| `cert_expires_at` | Certificate expiry date |
## Scan Processing Model
Scans run asynchronously through a DB-backed job queue:
1. User selects a tenant profile (or enters manual credentials) and submits URLs or a CSV.
2. API validates and normalizes URLs.
3. Default sites are skipped by rule (tenant root and app catalog).
4. A scan job is queued in PostgreSQL, linked to the tenant profile when applicable.
5. Background worker processes targets with retries and per-target timeout.
6. API/UI expose progress and deviations per job.
### Timeout and Retry Controls
Configured through environment variables (defaults shown):
| Variable | Default | Description |
|---|---|---|
| `SCAN_TARGET_TIMEOUT_SEC` | `3600` | Max seconds per target before it is marked failed |
| `SCAN_TARGET_MAX_RETRIES` | `2` | Number of retries on transient failure |
| `SCAN_RETRY_BASE_DELAY_SEC` | `2` | Base delay for exponential back-off between retries |
| `SCAN_JOB_POLL_INTERVAL_SEC` | `3` | How often the worker polls for new queued jobs |
| `SCAN_HTTP_TIMEOUT_SEC` | `30` | Per-request HTTP timeout toward SharePoint |
| `SCAN_HTTP_MAX_RETRIES` | `3` | Retries on HTTP 429/503 or connection errors |
| `SCAN_LIST_PAGE_SIZE` | `200` | Items per page when listing library contents |
| `SCAN_MAX_ITEMS_PER_LIST` | `10000` | Cap on items with unique permissions per library |
## Deviation Detection
The scanner retrieves SharePoint REST role assignments at four levels:
- Site root
- Document library
- Folder
- File
Only permissions **added** relative to the site root are stored as deviations (`delta_type=added`).
No filesystem/NTFS permission model is used.
### Hierarchical Deduplication
After all deviations for a target are collected they are post-processed: if a `(principal, role)` deviation is already reported at a parent URL (library or folder), the same deviation on child items is suppressed. This prevents an explosion of results when a single folder grant propagates to thousands of files.
Deduplication is pure in-memory post-processing — no additional API calls are made.
### Role Name Normalisation
SharePoint returns role names in the language configured for the tenant. Clearview normalises common Dutch role names to their English equivalents before storing them:
| Dutch | English |
|---|---|
| Volledig beheer | Full Control |
| Bijdragen | Contribute |
| Lezen | Read |
| Bewerken | Edit |
| Ontwerpen | Design |
| Beperkte toegang | Limited Access |
| Goedkeuren | Approve |
| Hiërarchieën beheren | Manage Hierarchy |
| Weergeven alleen | View Only |
| Beperkt lezen | Restricted Read |
Unknown role names are stored as-is.
### SharingLinks
SharePoint creates internal groups named `SharingLinks.{guid}.{LinkType}.{guid}` whenever a user shares a file or folder via a sharing link. Clearview detects these and classifies them by risk:
| Link type | Risk | UI colour |
|---|---|---|
| `Anonymous*` | Critical | Red |
| `Flexible` | High | Orange |
| `Organization*` | Low | Blue |
| `Direct*` | Low | Green |
**Resolve Sharing Links** — after a scan completes, the Job Details panel shows a _Resolve Sharing Links_ section listing all SharingLinks types found in the job. The user selects which types to resolve and clicks **Resolve**. Clearview calls `/_api/web/sitegroups/getbyname('{name}')/users` for each unique group using the job's stored credentials and writes the member list to `permission_deviations.resolved_members`. Anonymous links have no resolvable members; their `resolved_members` field is stored as an empty string, displayed as `(public link)` in the UI.
Anonymous and Flexible types are pre-selected by default. Organization and Direct types are available but unchecked by default.
### API Endpoints — Scan Jobs
```
GET /api/scan-jobs List jobs (optional ?tenant_profile_id=)
POST /api/scan-jobs Create job from URLs
POST /api/scan-jobs/import-csv Create job from CSV upload
GET /api/scan-jobs/{id} Get job detail (targets + deviations)
POST /api/scan-jobs/{id}/cancel Cancel a queued or running job
DELETE /api/scan-jobs/{id} Delete a completed job and all its data
POST /api/scan-jobs/{id}/resolve-sharing-links Resolve SharingLinks group members post-scan
GET /api/scan-jobs/{id}/export Download deviations as .xlsx (optional ?site_url=)
```
## Job Details UI
The **Selected Job Details** panel provides:
- **Site filter** — dropdown populated from the job's targets; filters both the Targets and Deviations tables client-side without a new API call.
- **Export Excel** — downloads a `.xlsx` with two sheets:
- _Targets_: URL, status, attempts, error, timestamps
- _Deviations_: Site URL, Object URL (relative to site), Object Type, Principal, Link Risk (colour-coded), Resolved Members, Role, Delta — sorted by Site URL → Object URL → Principal
- **Resolve Sharing Links** — see SharingLinks section above.
## CSV Import
Expected input is Microsoft Sites export format.
- URL column is auto-detected (`URL` / `Site URL` / `SiteUrl`).
- UTF-8 BOM is supported.
- Duplicate URLs are de-duplicated.
## Data Model
Main tables:
| Table | Key columns |
|---|---|
| `tenant_profiles` | credentials, `cert_private_key`, `cert_thumbprint`, `cert_expires_at` |
| `scan_jobs` | `status`, `tenant_profile_id`, progress counters, auth credentials |
| `scan_targets` | `job_id`, `site_url`, `status`, `attempts`, `error_message` |
| `permission_deviations` | `job_id`, `site_url`, `object_url`, `object_type`, `principal`, `role_name`, `delta_type`, `resolved_members` |
Scan jobs, targets, and deviations are cascade-deleted when a job is removed via `DELETE /api/scan-jobs/{id}`. Jobs with status `queued` or `running` cannot be deleted.
Schema migrations for new columns are applied automatically on startup via `_ensure_schema_columns()` in `main.py`.
## Build and Release
Use `./build-and-push.sh` from repo root.
- `./build-and-push.sh t` for test build (`:dev` tag only)
- `./build-and-push.sh 1` patch release
- `./build-and-push.sh 2` minor release
- `./build-and-push.sh 3` major release
## Current Scan Mode
`SHAREPOINT_SCAN_MODE=sharepoint_app_only` is active by default.
Azure app-only credentials are resolved per scan job from the linked tenant profile,
or from the raw credentials submitted with the job when no profile is used.
### Entra App Registration — two modes
The UI automatically detects which mode is active via `GET /api/onboarding/status`.
The onboarding flow is accessed from the **Add Tenant** form in the Tenants panel.
#### Mode A — Automated (platform app configured)
Requires a pre-registered Clearview platform app in Azure AD with permission to create
apps in customer tenants (`Application.ReadWrite.All` on Microsoft Graph).
Set the following in `stack/.env`:
```
ONBOARDING_CLIENT_ID=<platform-app-client-id>
ONBOARDING_CLIENT_SECRET=<platform-app-client-secret>
ONBOARDING_REDIRECT_URI=https://<your-clearview-domain>/api/onboarding/microsoft/callback
```
Flow per customer tenant:
1. Click **Add Tenant** in the UI and then **Connect Microsoft**.
2. Approve admin consent in the customer's Microsoft tenant.
3. UI receives tenant context from the OAuth callback and pre-fills the tenant ID.
4. Click **Create Scan App Automatically** to create a tenant-local scan app via Graph API.
5. Clearview assigns SharePoint `Sites.FullControl.All` and generates a client secret.
6. Enter a name and click **Save Tenant** to store the profile.
#### Mode B — Manual (no platform app configured)
When `ONBOARDING_*` env vars are empty the UI shows step-by-step instructions to create
the scan app manually per customer tenant:
1. Azure Portal → Entra ID → App registrations → New registration (Single tenant).
2. Copy Directory (tenant) ID and Application (client) ID from the Overview page.
3. API permissions → Add → SharePoint → Application permissions → `Sites.FullControl.All` → Grant admin consent.
4. Certificates & secrets → New client secret → copy the value (shown once).
5. Enter the details in **Add Tenant** and click **Save Tenant**.