- Split scanner.py into scanners/ package (entra, mailbox, sharepoint, common) - Add Exchange Online PowerShell probe scripts under scanners/exo_scripts - Frontend overhaul: AlertHub-style sidebar layout, dark logo asset, expanded app.js/index.html/styles.css - Backend updates across main.py, worker.py, models.py, schemas.py, csv_import.py - Update Dockerfile and build-and-push.sh - Update TECHNICAL.md, changelog-develop.md, add summary changelog.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
340 lines
17 KiB
Markdown
340 lines
17 KiB
Markdown
# TECHNICAL
|
|
|
|
## Scope
|
|
|
|
Clearview scans Microsoft 365 for permission deviations across two domains:
|
|
|
|
1. **SharePoint sites** — deviations relative to the site root permission baseline (libraries, folders, files).
|
|
2. **Exchange Online mailboxes** — non-default permissions: Full Access, Send As, Send on Behalf, and folder delegations (Calendar, Inbox).
|
|
|
|
Designed to monitor multiple customer tenants from a single instance.
|
|
|
|
## Runtime Architecture
|
|
|
|
- Single `clearview` application container (no separate API container).
|
|
- `postgres` service for persistent job and result storage.
|
|
- `adminer` service for direct database inspection.
|
|
|
|
All services are defined in `stack/docker-compose.yml` for Portainer deployment.
|
|
|
|
## Application Layout
|
|
|
|
- `containers/clearview/site/`
|
|
- Frontend UI: vanilla HTML/JS/CSS with a fixed sidebar and hash-based routing.
|
|
- Routes: `#/dashboard`, `#/jobs`, `#/scan/sharepoint`, `#/scan/mailbox`, `#/tenants`, `#/settings`.
|
|
- `containers/clearview/src/clearview_app/`
|
|
- FastAPI backend
|
|
- SQLAlchemy models
|
|
- CSV parser (SharePoint URLs and mailbox UPNs)
|
|
- Default-site filtering (SharePoint only)
|
|
- Background worker for long-running scans
|
|
- `containers/clearview/src/clearview_app/scanners/`
|
|
- `common.py` — `AuthConfig`, `DeviationRecord`, `ScanResult`, `ProbeResult`, shared helpers.
|
|
- `sharepoint.py` — SharePoint REST scanner, MSAL token cache, hierarchical dedup, SharingLinks helpers.
|
|
- `mailbox.py` — Exchange Online scanner; spawns `pwsh` with the EXO scripts.
|
|
- `exo_scripts/` — PowerShell scripts (`probe.ps1`, `get-permissions.ps1`).
|
|
- Dispatcher: `scanners.scan(scan_type, target, auth, progress)` and `scanners.probe(scan_type, target, auth)`.
|
|
|
|
## Multi-Tenant Model
|
|
|
|
Clearview uses **Tenant Profiles** to manage multiple customer tenants from one instance.
|
|
|
|
### Tenant Profiles
|
|
|
|
A tenant profile stores the Azure app credentials for one customer tenant:
|
|
|
|
| Field | Description |
|
|
|---|---|
|
|
| `name` | Label for internal reference (e.g. "Contoso") |
|
|
| `tenant_id` | Azure Directory (tenant) ID |
|
|
| `client_id` | Azure App (client) ID |
|
|
| `client_secret` | Azure App client secret |
|
|
|
|
Profiles are managed via the **Tenants** panel in the UI or directly via the API.
|
|
When starting a scan, you select a profile from the dropdown — no manual credential entry needed.
|
|
|
|
Ad-hoc scans without a saved profile are still supported via **Manual credentials** in the scan form.
|
|
|
|
### API Endpoints — Tenants
|
|
|
|
```
|
|
GET /api/tenants List all profiles (client_secret not returned)
|
|
POST /api/tenants Create a new profile
|
|
DELETE /api/tenants/{id} Delete a profile (jobs are retained, tenant link is cleared)
|
|
POST /api/tenants/{id}/generate-certificate Generate a self-signed certificate for this tenant
|
|
```
|
|
|
|
### Certificate Authentication
|
|
|
|
Clearview supports app-only authentication via a self-signed certificate (recommended) or a client secret.
|
|
|
|
**Generating a certificate:**
|
|
1. Click **Certificate** in the Tenants table.
|
|
2. Clearview generates a self-signed RSA-2048 certificate valid for 2 years.
|
|
3. Download the `.cer` file and upload it in Azure Portal → App registration → Certificates & secrets → Certificates.
|
|
4. The private key is stored internally; Clearview uses it automatically when starting a scan.
|
|
|
|
The scanner uses the certificate path when `cert_thumbprint` is present on the tenant profile; otherwise the client secret is used.
|
|
|
|
`TenantProfile` authentication fields:
|
|
|
|
| Field | Description |
|
|
|---|---|
|
|
| `client_secret` | Azure client secret (optional when a certificate is available) |
|
|
| `cert_private_key` | PEM-encoded private key (internal, never exposed via API) |
|
|
| `cert_public_pem` | PEM-encoded public certificate (used to build a PFX for Exchange Online PowerShell) |
|
|
| `cert_thumbprint` | SHA-1 thumbprint (used by MSAL) |
|
|
| `cert_expires_at` | Certificate expiry date |
|
|
|
|
## Scan Processing Model
|
|
|
|
Scans run asynchronously through a DB-backed job queue:
|
|
|
|
1. User selects a tenant profile (or enters manual credentials) and submits URLs or a CSV.
|
|
2. API validates and normalizes URLs.
|
|
3. Default sites are skipped by rule (tenant root and app catalog).
|
|
4. A scan job is queued in PostgreSQL, linked to the tenant profile when applicable.
|
|
5. Background worker processes targets with retries and per-target timeout.
|
|
6. API/UI expose progress and deviations per job.
|
|
|
|
### Connection Preflight
|
|
|
|
Before the full scan of a target runs, the worker performs a lightweight probe to verify that the configured credentials can actually reach the site and read role assignments. This catches the common setup errors (missing admin consent, certificate not yet uploaded to Azure, wrong tenant/client ID) early and with a clear message, instead of producing a silent 401 during the full scan.
|
|
|
|
The probe issues two calls:
|
|
|
|
1. `GET /_api/web?$select=Title` — validates token + tenant + site URL.
|
|
2. `GET /_api/web/roleassignments?$top=1&$select=PrincipalId` — validates that the app actually has permission to read role assignments (not only basic read).
|
|
|
|
The result is persisted per target in `last_probe_at`, `last_probe_ok`, and `last_probe_message`. If the probe fails, the target is marked `failed` with `error_message = "Preflight: <hint>"` and the full scan is skipped. Hints interpret common HTTP codes:
|
|
|
|
| Code | Hint |
|
|
|---|---|
|
|
| 401 on `/_api/web` | Certificate not uploaded in Azure, or wrong tenant/client ID |
|
|
| 401 on `/roleassignments` | Admin consent missing, or granted permission too low |
|
|
| 403 | App has no access to this site (e.g. `Sites.Selected` without a per-site grant) |
|
|
| 404 | Site not found |
|
|
|
|
The same probe is exposed as an on-demand **Test connection** action on each target in the Job Details UI (see API Endpoints below). The action is blocked while the job is still queued or running.
|
|
|
|
### Timeout and Retry Controls
|
|
|
|
Configured through environment variables (defaults shown):
|
|
|
|
| Variable | Default | Description |
|
|
|---|---|---|
|
|
| `SCAN_TARGET_TIMEOUT_SEC` | `3600` | Max seconds per target before it is marked failed |
|
|
| `SCAN_TARGET_MAX_RETRIES` | `2` | Number of retries on transient failure |
|
|
| `SCAN_RETRY_BASE_DELAY_SEC` | `2` | Base delay for exponential back-off between retries |
|
|
| `SCAN_JOB_POLL_INTERVAL_SEC` | `3` | How often the worker polls for new queued jobs |
|
|
| `SCAN_HTTP_TIMEOUT_SEC` | `30` | Per-request HTTP timeout toward SharePoint |
|
|
| `SCAN_HTTP_MAX_RETRIES` | `3` | Retries on HTTP 429/503 or connection errors |
|
|
| `SCAN_LIST_PAGE_SIZE` | `200` | Items per page when listing library contents |
|
|
| `SCAN_MAX_ITEMS_PER_LIST` | `10000` | Cap on items with unique permissions per library |
|
|
|
|
## Deviation Detection
|
|
|
|
The scanner retrieves SharePoint REST role assignments at four levels:
|
|
|
|
- Site root
|
|
- Document library
|
|
- Folder
|
|
- File
|
|
|
|
Only permissions **added** relative to the site root are stored as deviations (`delta_type=added`).
|
|
No filesystem/NTFS permission model is used.
|
|
|
|
### Hierarchical Deduplication
|
|
|
|
After all deviations for a target are collected they are post-processed: if a `(principal, role)` deviation is already reported at a parent URL (library or folder), the same deviation on child items is suppressed. This prevents an explosion of results when a single folder grant propagates to thousands of files.
|
|
|
|
Deduplication is pure in-memory post-processing — no additional API calls are made.
|
|
|
|
### Role Name Normalisation
|
|
|
|
SharePoint returns role names in the language configured for the tenant. Clearview normalises common Dutch role names to their English equivalents before storing them:
|
|
|
|
| Dutch | English |
|
|
|---|---|
|
|
| Volledig beheer | Full Control |
|
|
| Bijdragen | Contribute |
|
|
| Lezen | Read |
|
|
| Bewerken | Edit |
|
|
| Ontwerpen | Design |
|
|
| Beperkte toegang | Limited Access |
|
|
| Goedkeuren | Approve |
|
|
| Hiërarchieën beheren | Manage Hierarchy |
|
|
| Weergeven alleen | View Only |
|
|
| Beperkt lezen | Restricted Read |
|
|
|
|
Unknown role names are stored as-is.
|
|
|
|
### SharingLinks
|
|
|
|
SharePoint creates internal groups named `SharingLinks.{guid}.{LinkType}.{guid}` whenever a user shares a file or folder via a sharing link. Clearview detects these and classifies them by risk:
|
|
|
|
| Link type | Risk | UI colour |
|
|
|---|---|---|
|
|
| `Anonymous*` | Critical | Red |
|
|
| `Flexible` | High | Orange |
|
|
| `Organization*` | Low | Blue |
|
|
| `Direct*` | Low | Green |
|
|
|
|
**Resolve Sharing Links** — after a scan completes, the Job Details panel shows a _Resolve Sharing Links_ section listing all SharingLinks types found in the job. The user selects which types to resolve and clicks **Resolve**. Clearview calls `/_api/web/sitegroups/getbyname('{name}')/users` for each unique group using the job's stored credentials and writes the member list to `permission_deviations.resolved_members`. Anonymous links have no resolvable members; their `resolved_members` field is stored as an empty string, displayed as `(public link)` in the UI.
|
|
|
|
Anonymous and Flexible types are pre-selected by default. Organization and Direct types are available but unchecked by default.
|
|
|
|
### API Endpoints — Scan Jobs
|
|
|
|
```
|
|
GET /api/scan-jobs List jobs (optional ?tenant_profile_id=)
|
|
POST /api/scan-jobs Create job from URLs
|
|
POST /api/scan-jobs/import-csv Create job from CSV upload
|
|
GET /api/scan-jobs/{id} Get job detail (targets + deviations)
|
|
POST /api/scan-jobs/{id}/cancel Cancel a queued or running job
|
|
DELETE /api/scan-jobs/{id} Delete a completed job and all its data
|
|
POST /api/scan-jobs/{id}/resolve-sharing-links Resolve SharingLinks group members post-scan
|
|
POST /api/scan-jobs/{id}/targets/{tid}/test-connection Re-run the connection preflight for one target
|
|
GET /api/scan-jobs/{id}/export Download deviations as .xlsx (optional ?site_url=)
|
|
```
|
|
|
|
## Job Details UI
|
|
|
|
The **Selected Job Details** panel provides:
|
|
|
|
- **Site filter** — dropdown populated from the job's targets; filters both the Targets and Deviations tables client-side without a new API call.
|
|
- **Export Excel** — downloads a `.xlsx` with two sheets:
|
|
- _Targets_: URL, status, attempts, error, timestamps
|
|
- _Deviations_: Site URL, Object URL (relative to site), Object Type, Principal, Link Risk (colour-coded), Resolved Members, Role, Delta — sorted by Site URL → Object URL → Principal
|
|
- **Resolve Sharing Links** — see SharingLinks section above.
|
|
|
|
## CSV Import
|
|
|
|
Expected input is Microsoft Sites export format.
|
|
|
|
- URL column is auto-detected (`URL` / `Site URL` / `SiteUrl`).
|
|
- UTF-8 BOM is supported.
|
|
- Duplicate URLs are de-duplicated.
|
|
|
|
## Data Model
|
|
|
|
Main tables:
|
|
|
|
| Table | Key columns |
|
|
|---|---|
|
|
| `tenant_profiles` | credentials, `cert_private_key`, `cert_public_pem`, `cert_thumbprint`, `cert_expires_at` |
|
|
| `scan_jobs` | `status`, `scan_type` (`sharepoint`/`mailbox`), `tenant_profile_id`, progress counters, auth credentials |
|
|
| `scan_targets` | `job_id`, `site_url` (holds UPN for mailbox jobs), `status`, `attempts`, `error_message`, `last_probe_at`, `last_probe_ok`, `last_probe_message` |
|
|
| `permission_deviations` | `job_id`, `site_url`, `object_url`, `object_type`, `principal`, `role_name`, `delta_type`, `permission_type`, `resolved_members` |
|
|
|
|
Scan jobs, targets, and deviations are cascade-deleted when a job is removed via `DELETE /api/scan-jobs/{id}`. Jobs with status `queued` or `running` cannot be deleted.
|
|
|
|
Schema migrations for new columns are applied automatically on startup via `_ensure_schema_columns()` in `main.py`.
|
|
|
|
## Mailbox Scanning
|
|
|
|
Mailbox scans use Exchange Online PowerShell with certificate-based app-only auth.
|
|
|
|
### What is collected
|
|
|
|
| Permission | PowerShell source | `permission_type` value |
|
|
|---|---|---|
|
|
| Full Access (and other mailbox-level rights) | `Get-MailboxPermission` | `FullAccess` |
|
|
| Send As | `Get-RecipientPermission` (`AccessControlType=Allow`) | `SendAs` |
|
|
| Send on Behalf | mailbox property `GrantSendOnBehalfTo` | `SendOnBehalf` |
|
|
| Folder delegation — Calendar | `Get-MailboxFolderPermission "<upn>:\Calendar"` | `Folder:Calendar` |
|
|
| Folder delegation — Inbox | `Get-MailboxFolderPermission "<upn>:\Inbox"` | `Folder:Inbox` |
|
|
|
|
The scanner filters out `NT AUTHORITY\SELF`, `S-1-5-*` SIDs, inherited mailbox permissions, and the default folder principals (`Default`, `Anonymous` with `None` rights). What remains is stored as deviations on the job — there is no SharePoint-style root baseline; every non-default principal counts.
|
|
|
|
### Authentication
|
|
|
|
Mailbox scanning uses the **same tenant certificate** as SharePoint, but Exchange Online requires a `.pfx` rather than a thumbprint + raw private key. At scan time Clearview builds an in-memory PFX from `cert_private_key` + `cert_public_pem` (random password), writes it to a tempdir, and removes it immediately after the `pwsh` process exits.
|
|
|
|
### Targets
|
|
|
|
Three ways to seed a mailbox scan job:
|
|
|
|
1. **Manual UPNs** — paste one UPN per line.
|
|
2. **CSV import** — column `UserPrincipalName` / `Email` / `Mailbox` / `Primary SMTP Address` (auto-detected, case-insensitive).
|
|
3. **All mailboxes in tenant** — Clearview enumerates every mailbox via `Get-EXOMailbox -ResultSize Unlimited` and queues one target per mailbox. Requires the tenant's primary domain (e.g. `contoso.onmicrosoft.com`) so `Connect-ExchangeOnline -Organization` can authenticate. Capped at 50000 mailboxes per job.
|
|
|
|
### Required Azure permissions
|
|
|
|
In addition to the SharePoint setup the scan app needs:
|
|
|
|
- API permission: **Office 365 Exchange Online → Application permissions → `Exchange.ManageAsApp`** (admin-consented).
|
|
- Entra role assigned to the app's service principal: **Exchange Administrator** (cannot be granted via Microsoft Graph; must be assigned in Azure Portal → Entra ID → Roles and administrators).
|
|
|
|
### Runtime requirements
|
|
|
|
The container image installs:
|
|
|
|
- **PowerShell 7 (`pwsh`)** from the official Microsoft package repo.
|
|
- **`ExchangeOnlineManagement`** module from PSGallery (`Install-Module -Scope AllUsers`).
|
|
|
|
Adds roughly 150 MB to the image. Without these, mailbox probes return `pwsh not available in runtime` and scans fail.
|
|
|
|
### Probe
|
|
|
|
Mailbox preflight runs `probe.ps1` which connects to Exchange Online and calls `Get-EXOMailbox -Identity <upn> -PropertySets Minimum`. Failure hints map common errors:
|
|
|
|
| Error fragment | Hint |
|
|
|---|---|
|
|
| `Unauthorized` / `401` / `AADSTS*` | Check `Exchange.ManageAsApp` permission, admin consent, and the Exchange Administrator role assignment |
|
|
| `Couldn't find object` / `not found` | Mailbox does not exist in this tenant |
|
|
| `module not available` | `ExchangeOnlineManagement` PS module missing in the container |
|
|
|
|
## Build and Release
|
|
|
|
`./build-and-push.sh` from the repo root, sourced from the shared script in `/docker/develop/shared-integrations/tooling/docker-build-and-push/`.
|
|
|
|
- `./build-and-push.sh t` — test build, push `:dev` tag only.
|
|
- `./build-and-push.sh r` — release build, parses the version from `docs/changelog.md` (first `## vX.Y.Z` heading), pushes `:<version>`, `:dev`, and `:latest`.
|
|
|
|
The script performs no git operations. After a successful release, run the `git commit` / `git tag` / `git push --tags` commands the script prints in its summary.
|
|
|
|
## Current Scan Mode
|
|
|
|
`SHAREPOINT_SCAN_MODE=sharepoint_app_only` is active by default.
|
|
|
|
Azure app-only credentials are resolved per scan job from the linked tenant profile,
|
|
or from the raw credentials submitted with the job when no profile is used.
|
|
|
|
### Entra App Registration — two modes
|
|
|
|
The UI automatically detects which mode is active via `GET /api/onboarding/status`.
|
|
The onboarding flow is accessed from the **Add Tenant** form in the Tenants panel.
|
|
|
|
#### Mode A — Automated (platform app configured)
|
|
|
|
Requires a pre-registered Clearview platform app in Azure AD with permission to create
|
|
apps in customer tenants (`Application.ReadWrite.All` on Microsoft Graph).
|
|
|
|
Set the following in `stack/.env`:
|
|
|
|
```
|
|
ONBOARDING_CLIENT_ID=<platform-app-client-id>
|
|
ONBOARDING_CLIENT_SECRET=<platform-app-client-secret>
|
|
ONBOARDING_REDIRECT_URI=https://<your-clearview-domain>/api/onboarding/microsoft/callback
|
|
```
|
|
|
|
Flow per customer tenant:
|
|
1. Click **Add Tenant** in the UI and then **Connect Microsoft**.
|
|
2. Approve admin consent in the customer's Microsoft tenant.
|
|
3. UI receives tenant context from the OAuth callback and pre-fills the tenant ID.
|
|
4. Click **Create Scan App Automatically** to create a tenant-local scan app via Graph API.
|
|
5. Clearview assigns SharePoint `Sites.FullControl.All` and generates a client secret.
|
|
6. Enter a name and click **Save Tenant** to store the profile.
|
|
|
|
#### Mode B — Manual (no platform app configured)
|
|
|
|
When `ONBOARDING_*` env vars are empty the UI shows step-by-step instructions to create
|
|
the scan app manually per customer tenant:
|
|
|
|
1. Azure Portal → Entra ID → App registrations → New registration (Single tenant).
|
|
2. Copy Directory (tenant) ID and Application (client) ID from the Overview page.
|
|
3. API permissions → Add → SharePoint → Application permissions → `Sites.FullControl.All` → Grant admin consent.
|
|
4. Certificates & secrets → New client secret → copy the value (shown once).
|
|
5. Enter the details in **Add Tenant** and click **Save Tenant**.
|