images: harden the cache per Codex audit (SSRF-safe, cache-only endpoint, WebP-only)

Blocker fixes for the image cache:
- /api/img/{id} now serves cache HITS ONLY and is restricted to ACCEPTED, CANONICAL
  articles. It never fetches — the cycle (newsimg.warm) owns all fetching — so the
  public endpoint has no SSRF/worker-exhaustion surface. Dropped 1-year immutable
  caching (image_url can change) → public, max-age=86400.
- newsimg._safe_fetch: SSRF-safe (reuses enrich._host_is_public + _NoRedirect, http(s)
  only, every redirect hop re-validated, body capped). _FetchError distinguishes
  permanent refusals (negative-cached via a .fail marker) from transient errors (retry).
- _encode re-encodes only decoded RASTER images to WebP and REJECTS everything else
  (SVG, undecodable, decompression bombs via MAX_IMAGE_PIXELS, pathological dimensions);
  originals are never retained. prune() also sweeps stale .fail markers.
- Concurrency: fetching only runs inside the cycle lock; writes stay atomic.

Smaller fixes:
- share.py visible image has onerror→this.remove() (degrade to the text unfurl, no
  broken icon when an image isn't cached yet).
- share-page Back follows history only on a SAME-ORIGIN referrer (never bounce to an
  external site); menu now honors Escape + resets crossing back to desktop (HubBar parity).

Tests: private host, redirect-to-private, hostile SVG/non-image, transient-vs-permanent
failure, LRU prune, warm (accepted+canonical only, idempotent), cache-only endpoint
(404 on not-cached/unaccepted/duplicate, never fetches), share chrome parity. 441 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
jay
2026-06-30 12:19:57 -04:00
parent c350a2713b
commit a55ba185a8
5 changed files with 278 additions and 128 deletions
+15 -13
View File
@@ -2339,22 +2339,24 @@ def create_app() -> FastAPI:
@app.api_route("/api/img/{article_id}", methods=["GET", "HEAD"])
def article_image(article_id: int) -> FileResponse:
"""Serve a locally-cached, downscaled copy of an article's image instead of
hotlinking the source. Allowlisted by construction: the id resolves to a URL
already in our corpus (no open proxy). A miss fetches+caches once; a hard
failure 404s, which the frontend handles (retry, then the typo cover)."""
"""Serve the locally-cached, downscaled WebP for an accepted, canonical article.
Cache HITS ONLY — this never fetches (the cycle owns all fetching via newsimg.warm),
so the public endpoint has no SSRF/worker-exhaustion surface. A miss 404s; the
frontend handles that (retry, then the typographic cover)."""
with get_conn() as conn:
row = conn.execute("SELECT image_url FROM articles WHERE id = ?", (article_id,)).fetchone()
row = conn.execute(
"SELECT a.image_url FROM articles a JOIN article_scores s ON s.article_id = a.id "
"WHERE a.id = ? AND s.accepted = 1 AND a.duplicate_of IS NULL",
(article_id,),
).fetchone()
url = row["image_url"] if row else None
if not url:
raise HTTPException(status_code=404, detail="no image for this article")
path = newsimg.get_or_fetch(url)
path = newsimg.path_for(url) if url else None
if not path:
raise HTTPException(status_code=404, detail="image unavailable")
media = "image/webp" if path.suffix == ".webp" else None
# Stable per article id (image_url doesn't change) → cache hard at the edge/browser.
return FileResponse(str(path), media_type=media,
headers={"Cache-Control": "public, max-age=31536000, immutable"})
raise HTTPException(status_code=404, detail="image not cached")
# NOT immutable: a re-enriched article can change image_url, so the bytes behind a
# given id can change. A day's browser cache is plenty (we're direct-origin, no CDN).
return FileResponse(str(path), media_type="image/webp",
headers={"Cache-Control": "public, max-age=86400"})
@app.api_route("/api/art/image/{object_id}", methods=["GET", "HEAD"])
def art_image(object_id: int, size: str = Query("")) -> FileResponse:
+129 -65
View File
@@ -1,40 +1,53 @@
"""Local image cache + downscale for news article images.
The hub, feed, and article pages used to hotlink each article's image_url straight
from the source's server, so a slow / rate-limited / flaky third-party CDN left a
blank graphic until a refresh. Instead we cache a downscaled display copy on our own
origin (beside the DB, like art_cache) and serve that. The cache is bounded by a HARD
size ceiling with LRU eviction (prune), so it can't grow without limit no matter the
ingest rate. Network + Pillow calls are isolated so tests can monkeypatch them.
Article images used to be hotlinked from the source, so a slow/flaky third-party CDN
left a blank graphic until a refresh. Instead the CYCLE fetches a downscaled WebP copy
to data/img_cache/ (beside the DB, mounted into the API container, mirrors art_cache),
and the API serves only cache HITS — it never fetches, so the public endpoint has no
SSRF or worker-exhaustion surface. The cache is bounded by a hard size ceiling with LRU
eviction, so it can't grow without limit no matter the ingest rate.
Keyed by a hash of the source URL: a given image_url always maps to the same file.
The API resolves an article id -> its image_url (a tight allowlist — we only ever
fetch URLs already in our own corpus, so it is not an open proxy)."""
Security posture (the fetch runs only in the trusted cycle, but feed image URLs are
still externally supplied, so we treat them as untrusted):
* SSRF-safe fetch reuses enrich._host_is_public + bounded redirect re-validation
(same path as feeds.safe_fetch_feed) — no private/loopback/link-local targets,
http(s) only, every redirect hop re-checked.
* Only successfully-decoded RASTER images are re-encoded to WebP and stored; SVG and
anything undecodable is REJECTED (never retained as a same-origin file).
* Decompression-bomb + dimension guards.
* Definitive failures are negative-cached (a .fail marker) so a bad URL isn't refetched
every cycle; transient network errors are not, so they retry.
Concurrency: all fetching happens inside the cycle, which holds an exclusive lock, so no
two fetches race; writes are atomic (temp + rename) regardless.
"""
from __future__ import annotations
import hashlib
import io
import os
import time
import urllib.error
import urllib.request
from pathlib import Path
from urllib.parse import urljoin, urlsplit
from .enrich import MAX_REDIRECTS, _NoRedirect, _host_is_public
_UA = {"User-Agent": "upbeatBytes/1.0 (+https://upbeatbytes.com)"}
_MIN_IMAGE_BYTES = 500
_MAX_FETCH_BYTES = 20 * 1024 * 1024 # never pull an absurd original into memory
_MAX_PIXELS = 50_000_000 # decompression-bomb ceiling (≈50 MP)
_MAX_DIM = 12000 # reject pathological single-axis dimensions
DISPLAY_WIDTH = 800 # cards / feed never show wider than this
WEBP_QUALITY = 80
DEFAULT_CAP_BYTES = 1024 * 1024 * 1024 # 1 GB hard ceiling (override via env)
_FAIL_TTL_S = 3 * 24 * 3600 # don't refetch a definitively-bad URL for 3 days
def cache_dir() -> Path:
"""Where cached images live — beside the DB, so the host cycle writes and the API
container reads the same mounted volume (mirrors art.cache_dir)."""
override = os.environ.get("GOODNEWS_IMG_CACHE")
if override:
d = Path(override)
else:
db = Path(os.environ.get("GOODNEWS_DB", "data/goodnews.sqlite3"))
d = db.parent / "img_cache"
db = Path(os.environ.get("GOODNEWS_DB", "data/goodnews.sqlite3"))
d = Path(override) if override else db.parent / "img_cache"
d.mkdir(parents=True, exist_ok=True)
return d
@@ -50,19 +63,54 @@ def _key(url: str) -> str:
return hashlib.sha1(url.encode("utf-8")).hexdigest()
def _http_bytes(url: str, timeout: int = 12) -> tuple[bytes, str]:
req = urllib.request.Request(url, headers=_UA)
with urllib.request.urlopen(req, timeout=timeout) as r:
return r.read(_MAX_FETCH_BYTES + 1), (r.headers.get("Content-Type") or "")
class _FetchError(Exception):
"""permanent=True → negative-cache (won't retry soon); False → transient, retry."""
def __init__(self, msg: str, permanent: bool):
super().__init__(msg)
self.permanent = permanent
def _safe_fetch(url: str, timeout: int = 12) -> tuple[bytes, str]:
"""SSRF-safe fetch of an untrusted image URL: http(s) only, every redirect hop
re-validated against public IPs, bounded redirects, body capped. Raises _FetchError
(permanent for policy refusals, transient for network errors)."""
opener = urllib.request.build_opener(_NoRedirect)
current = url
for _ in range(MAX_REDIRECTS + 1):
parts = urlsplit(current)
if parts.scheme not in ("http", "https") or not _host_is_public(parts.hostname):
raise _FetchError(f"non-public or non-http(s): {current}", permanent=True)
req = urllib.request.Request(current, headers=_UA)
try:
resp = opener.open(req, timeout=timeout)
except (urllib.error.URLError, OSError, ValueError) as exc:
raise _FetchError(f"fetch failed: {exc}", permanent=False) from exc
status = getattr(resp, "status", 200) or 200
if status in (301, 302, 303, 307, 308):
loc = resp.headers.get("Location")
resp.close()
if not loc:
raise _FetchError("redirect without location", permanent=True)
current = urljoin(current, loc)
continue
try:
return resp.read(_MAX_FETCH_BYTES + 1), (resp.headers.get("Content-Type") or "")
finally:
resp.close()
raise _FetchError("too many redirects", permanent=True)
def _encode(data: bytes) -> bytes | None:
"""Downscale to DISPLAY_WIDTH and re-encode as WebP. None if it isn't a decodable
raster image (e.g. SVG) — the caller then stores the original bytes as-is."""
"""Downscale a decoded RASTER image to DISPLAY_WIDTH and re-encode as WebP. None if
it isn't a decodable raster (e.g. SVG), is a decompression bomb, or has pathological
dimensions — the caller then REJECTS it (never stores arbitrary bytes)."""
try:
from PIL import Image
Image.MAX_IMAGE_PIXELS = _MAX_PIXELS # raise DecompressionBombError past this
im = Image.open(io.BytesIO(data))
im.load()
im.load() # forces decode → catches truncated/bomb here
if im.width > _MAX_DIM or im.height > _MAX_DIM or im.width < 1 or im.height < 1:
return None
if im.mode not in ("RGB", "RGBA"):
im = im.convert("RGBA" if ("A" in im.mode or im.mode == "P") else "RGB")
if im.width > DISPLAY_WIDTH:
@@ -71,63 +119,69 @@ def _encode(data: bytes) -> bytes | None:
out = io.BytesIO()
im.save(out, format="WEBP", quality=WEBP_QUALITY, method=4)
return out.getvalue()
except Exception: # noqa: BLE001 — not a decodable raster image
except Exception: # noqa: BLE001 — UnidentifiedImageError, DecompressionBombError, SVG, truncated …
return None
def _ext_for(ctype: str) -> str:
c = ctype.lower()
if "png" in c:
return ".png"
if "gif" in c:
return ".gif"
if "svg" in c:
return ".svg"
if "webp" in c:
return ".webp"
return ".jpg"
def _fail_path(url: str) -> Path:
return cache_dir() / f"{_key(url)}.fail"
def _mark_failed(url: str) -> None:
try:
_fail_path(url).touch()
except OSError:
pass
def _failed_recently(url: str) -> bool:
try:
return (time.time() - _fail_path(url).stat().st_mtime) < _FAIL_TTL_S
except OSError:
return False
def path_for(url: str) -> Path | None:
"""The cached file for this URL if present (and bump its mtime, the LRU marker)."""
for p in cache_dir().glob(_key(url) + ".*"):
"""The cached WebP for this URL if present (and bump its mtime, the LRU marker).
A pure cache lookup — never fetches."""
if not url:
return None
p = cache_dir() / f"{_key(url)}.webp"
if p.exists():
try:
os.utime(p, None) # touch -> last-used time for LRU eviction
os.utime(p, None) # touch last-used time for LRU eviction
except OSError:
pass
return p
return None
def get_or_fetch(url: str | None) -> Path | None:
"""Cached display copy for a source image URL, fetching + caching on first miss.
Atomic write (temp then rename) so a reader never sees a half-file. None on any
failure — callers (endpoint 404 -> frontend retry/typo cover) degrade gracefully."""
def fetch_and_cache(url: str | None) -> Path | None:
"""Fetch (SSRF-safe), downscale to WebP, and cache atomically. CYCLE-ONLY — the API
endpoint never calls this. None on any failure; definitive failures are negative-cached
so they aren't retried every cycle."""
if not url or not url.startswith(("http://", "https://")):
return None
hit = path_for(url)
if hit:
return hit
try:
data, ctype = _http_bytes(url)
except Exception: # noqa: BLE001 — source down/slow/blocked
data, _ctype = _safe_fetch(url)
except _FetchError as exc:
if exc.permanent:
_mark_failed(url)
return None
if len(data) < _MIN_IMAGE_BYTES or len(data) > _MAX_FETCH_BYTES:
if not (_MIN_IMAGE_BYTES <= len(data) <= _MAX_FETCH_BYTES):
_mark_failed(url)
return None
encoded = _encode(data)
if encoded is not None:
blob, ext = encoded, ".webp"
elif ctype.startswith("image/"):
blob, ext = data, _ext_for(ctype) # couldn't re-encode (e.g. SVG): keep original
else:
blob = _encode(data)
if blob is None: # SVG / undecodable / bomb / bad dimensions
_mark_failed(url)
return None
key = _key(url)
cdir = cache_dir()
key = _key(url)
tmp = cdir / f".{key}.tmp"
dest = cdir / f"{key}{ext}"
dest = cdir / f"{key}.webp"
try:
tmp.write_bytes(blob)
os.replace(tmp, dest) # atomic
os.replace(tmp, dest) # atomic
except OSError:
try:
tmp.unlink()
@@ -138,9 +192,9 @@ def get_or_fetch(url: str | None) -> Path | None:
def warm(conn, limit: int = 200) -> int:
"""Pre-fetch display copies for the newest accepted articles that have an image, so
the FIRST page view is already a local hit (no first-view flakiness). Bounded; skips
already-cached. Returns how many it newly cached."""
"""Pre-fetch display copies for the newest ACCEPTED, CANONICAL articles that have an
image, so the API only ever serves cache hits. Bounded; skips already-cached and
recently-failed URLs. Returns how many it newly cached."""
rows = conn.execute(
"SELECT DISTINCT a.image_url FROM article_scores s JOIN articles a ON a.id = s.article_id "
"WHERE s.accepted=1 AND a.duplicate_of IS NULL AND a.image_url IS NOT NULL "
@@ -150,21 +204,31 @@ def warm(conn, limit: int = 200) -> int:
made = 0
for r in rows:
url = r[0]
if path_for(url):
if path_for(url) or _failed_recently(url):
continue
if get_or_fetch(url):
if fetch_and_cache(url):
made += 1
return made
def prune(cap: int | None = None) -> dict:
"""Enforce the size ceiling: delete least-recently-used files (oldest mtime first)
until the cache is under the cap. Returns {before, after, removed, cap}."""
"""Enforce the size ceiling: delete least-recently-used WebPs (oldest mtime first)
until under the cap; also sweep stale .fail markers. Returns {before, after, removed, cap}."""
if cap is None:
cap = cap_bytes()
now = time.time()
files, total = [], 0
for p in cache_dir().iterdir():
if not p.is_file() or p.name.startswith("."):
if p.name.startswith("."):
continue
if p.suffix == ".fail":
try:
if now - p.stat().st_mtime >= _FAIL_TTL_S:
p.unlink()
except OSError:
pass
continue
if p.suffix != ".webp" or not p.is_file():
continue
try:
st = p.stat()
+13 -4
View File
@@ -98,13 +98,17 @@ _TOP_BAR_CSS = """
"""
# Burger toggle + signed-in avatar (read from the SPA's localStorage cache, same as HubBar).
# Parity with HubBar: Escape closes the menu, and crossing back to desktop width resets it.
_TOP_BAR_JS = """<script>
(function(){
var b=document.querySelector('[data-burger]'), m=document.querySelector('[data-menu]');
function close(){ if(m) m.setAttribute('hidden',''); if(b){ b.classList.remove('open'); b.setAttribute('aria-expanded','false'); } }
if(b&&m){ b.addEventListener('click',function(){
if(m.hasAttribute('hidden')){ m.removeAttribute('hidden'); b.classList.add('open'); b.setAttribute('aria-expanded','true'); }
else { m.setAttribute('hidden',''); b.classList.remove('open'); b.setAttribute('aria-expanded','false'); }
else { close(); }
}); }
document.addEventListener('keydown',function(e){ if(e.key==='Escape') close(); });
if(window.matchMedia){ var mq=window.matchMedia('(min-width:721px)'); if(mq.addEventListener) mq.addEventListener('change',function(e){ if(e.matches) close(); }); }
try{
var u=JSON.parse(localStorage.getItem('goodnews:auth_user')||'null');
if(u&&u.avatar_url){
@@ -117,12 +121,14 @@ _TOP_BAR_JS = """<script>
})();
</script>"""
# Single-history Back: return to where you came from in-app, else home (mirrors HubShell).
# Single-history Back (mirrors HubShell): go back ONLY when we arrived from our own origin,
# else go home — never bounce the reader off to an external referrer.
_BACK_JS = """<script>
(function(){
var b=document.querySelector('[data-back]'); if(!b) return;
b.addEventListener('click',function(){
if(document.referrer && history.length>1){ history.back(); } else { location.href='/'; }
var same=false; try{ same=!!document.referrer && new URL(document.referrer).origin===location.origin; }catch(e){}
if(same && history.length>1){ history.back(); } else { location.href='/'; }
});
})();
</script>"""
@@ -165,8 +171,11 @@ def render_share_page(article: dict, base_url: str, summary: str | None = None,
# The visible image is served from our cached/downscaled copy (not a hotlink), so a
# flaky source CDN can't blank it. og:image/twitter:image above stay the source URL
# so social crawlers fetch the canonical image directly.
# Served from our cache (/api/img/<id>); if it isn't cached yet / fails, drop the
# element so the page degrades to the clean text unfurl rather than a broken icon.
media = (
f'<img class="media" src="/api/img/{aid}" alt="" referrerpolicy="no-referrer">'
f'<img class="media" src="/api/img/{aid}" alt="" referrerpolicy="no-referrer" '
f'onerror="this.remove()">'
if image else ""
)