Files
upbeatBytes/docs/images-and-visitor-metrics.md
T

78 lines
5.3 KiB
Markdown

# Images & visitor metrics — policy of record
Encodes the decisions from the 2026-06-30 Codex audit. Update this doc when the policy changes.
## Article images — source-level rights policy
We do **not** blanket-cache publisher images. Caching (re-hosting a copy on our origin) is
**opt-in per source**; the safe default is to display the publisher's own image without copying it.
`sources.image_policy` (one of):
| policy | meaning | when to use |
|---|---|---|
| `cache` | Re-host a downscaled WebP (≤800px) at `/api/img/<id>` | ONLY sources we've cleared: open license (CC etc.), explicit permission, public domain, or our own/gov public-domain material |
| `remote` (**default**) | Hotlink the publisher's image URL (with the frontend's graceful retry) | Anything not explicitly cleared — display, never copy |
| `none` | No image (typographic topic cover) | Sources whose terms don't support display, or known-bad imagery |
**Default for new/unknown sources is `remote`.** Nothing is re-hosted until an admin sets a
source to `cache` (admin → Sources → expand a source → Images selector, or
`POST /api/admin/sources/{id}/image-policy {"policy": "cache"}`).
Why conservative: fair use is case-specific and balances four factors; "reduced + attributed +
linked" is good practice but **not permission**, and the search-engine-thumbnail precedent
(Perfect 10 v. Amazon) turned on a specifically transformative image-search function, not a
universal thumbnail exemption. So re-hosting waits on a per-source rights basis.
### How resolution works
- `newsimg.display_url(article_id, image_policy, raw_url)` returns the display URL: `/api/img/<id>`
for `cache`, the publisher URL for `remote`, `None` for `none`/no-image.
- Applied server-side in `Article.from_row` (feed/brief/history) and `share.py` (the `/a/<id>`
page). `og:image`/`twitter:image` always reference the **publisher's own** image URL (a link,
not a copy) — never our cached path — so social crawlers don't hit our endpoint.
- The frontend uses the resolved `image_url` as-is; the hub probe-retry + ArticleCard `onerror`
cover slow/failed loads for both cached and hotlinked URLs.
### Cache mechanics (`goodnews/newsimg.py`)
- **The cycle owns all fetching** (`warm()`), gated on `image_policy='cache'`, under the cycle lock.
`GET /api/img/<id>` serves **cache hits only** (never fetches), restricted to accepted + canonical
articles whose source is `cache`. No SSRF/worker-exhaustion surface on the public endpoint.
- `_safe_fetch`: http(s) only, `_host_is_public` on every redirect hop (HTTPError-based redirects
followed), body capped. 4xx≠429 = permanent (negative-cached via `<sha1>.fail`); 429/5xx/network = transient.
- `_encode`: decoded raster → WebP only; rejects SVG/undecodable; pixel ceiling enforced
(`w*h > _MAX_PIXELS`) **before** decode. Originals are never retained.
- Bounded: hard size cap (default 1 GB, `GOODNEWS_IMG_CACHE_CAP`) with LRU eviction; `.fail`
markers swept after `_FAIL_TTL_S`. `data/img_cache/` is gitignored (runtime data).
- **Revocation:** when a source leaves `cache` (set to `remote`/`none` in admin), the endpoint
calls `newsimg.purge_source()` to delete that source's re-hosted copies **immediately** — they
don't linger on disk. (Setting *to* `cache` just flips the flag; the cycle warms it.)
## Visitor metrics — Recorded visits vs Engaged readers
A JS-capable bot can trip the visit beacon, so the admin shows two numbers:
- **Recorded visits** — raw count: one daily `visit` beacon per device. Known-bot User-Agents are
filtered at `/api/events` (`queries.is_bot_ua`), but UA-spoofing bots still land here. Noisy.
- **Engaged readers** — distinct visitor-day with **deliberate** activity (the honest number):
- the gesture-gated `engaged` beacon (`analytics.armEngaged`, mirrored on the share page) — fires
once/day only after ~8s visible **and** a real scroll/pointer/key/touch; or
- a deliberate action: `source_click`, `full_story`, `share_ub`/`copy_source`/`native_share`,
`replace_used`, `paywall_replace`, `paywalled_source_open`, `not_today`/`less_like_this`/`hide_topic`,
or a game `started`/`completed`/`shared`.
- **Never** counts auto-fired `visit`/`summary_viewed`/`open`, `replace_none`, or game `*_arrival`.
- Defined by `queries.ENGAGED_EVENT_KINDS`; surfaced as `visitors.engaged_today/d7/d30`.
**Warm-up caveat:** the `engaged` beacon began **2026-06-30**, so rolling windows fill over time —
a low `engaged_d7`/`engaged_d30` is partly warm-up, NOT proof the gap to recorded visits was all
bots. Compare `d7` after a full week, `d30` after thirty days. (Admin shows this note inline.)
Privacy unchanged: only a salted `visitor_hash` is stored (no IP, no raw token, no fingerprint).
### Referrer suppression on remote images
Every on-site image request for a `remote` source sets `referrerpolicy="no-referrer"` so the
publisher CDN doesn't get the referring URL: article cards, the share page, AND the homepage hero
(converted from a CSS `background-image` to a real `<img>` — the retry probe sets
`probe.referrerPolicy='no-referrer'` too). This hides the *referrer*, **not** the visitor's IP —
any remote image necessarily exposes the IP to the CDN. For zero third-party image requests, the
source must be `none` or explicitly cleared for local caching (`cache`).