images: harden the cache per Codex audit (SSRF-safe, cache-only endpoint, WebP-only)

Blocker fixes for the image cache:
- /api/img/{id} now serves cache HITS ONLY and is restricted to ACCEPTED, CANONICAL
  articles. It never fetches — the cycle (newsimg.warm) owns all fetching — so the
  public endpoint has no SSRF/worker-exhaustion surface. Dropped 1-year immutable
  caching (image_url can change) → public, max-age=86400.
- newsimg._safe_fetch: SSRF-safe (reuses enrich._host_is_public + _NoRedirect, http(s)
  only, every redirect hop re-validated, body capped). _FetchError distinguishes
  permanent refusals (negative-cached via a .fail marker) from transient errors (retry).
- _encode re-encodes only decoded RASTER images to WebP and REJECTS everything else
  (SVG, undecodable, decompression bombs via MAX_IMAGE_PIXELS, pathological dimensions);
  originals are never retained. prune() also sweeps stale .fail markers.
- Concurrency: fetching only runs inside the cycle lock; writes stay atomic.

Smaller fixes:
- share.py visible image has onerror→this.remove() (degrade to the text unfurl, no
  broken icon when an image isn't cached yet).
- share-page Back follows history only on a SAME-ORIGIN referrer (never bounce to an
  external site); menu now honors Escape + resets crossing back to desktop (HubBar parity).

Tests: private host, redirect-to-private, hostile SVG/non-image, transient-vs-permanent
failure, LRU prune, warm (accepted+canonical only, idempotent), cache-only endpoint
(404 on not-cached/unaccepted/duplicate, never fetches), share chrome parity. 441 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
jay
2026-06-30 12:19:57 -04:00
parent c350a2713b
commit a55ba185a8
5 changed files with 278 additions and 128 deletions
+15 -13
View File
@@ -2339,22 +2339,24 @@ def create_app() -> FastAPI:
@app.api_route("/api/img/{article_id}", methods=["GET", "HEAD"]) @app.api_route("/api/img/{article_id}", methods=["GET", "HEAD"])
def article_image(article_id: int) -> FileResponse: def article_image(article_id: int) -> FileResponse:
"""Serve a locally-cached, downscaled copy of an article's image instead of """Serve the locally-cached, downscaled WebP for an accepted, canonical article.
hotlinking the source. Allowlisted by construction: the id resolves to a URL Cache HITS ONLY — this never fetches (the cycle owns all fetching via newsimg.warm),
already in our corpus (no open proxy). A miss fetches+caches once; a hard so the public endpoint has no SSRF/worker-exhaustion surface. A miss 404s; the
failure 404s, which the frontend handles (retry, then the typo cover).""" frontend handles that (retry, then the typographic cover)."""
with get_conn() as conn: with get_conn() as conn:
row = conn.execute("SELECT image_url FROM articles WHERE id = ?", (article_id,)).fetchone() row = conn.execute(
"SELECT a.image_url FROM articles a JOIN article_scores s ON s.article_id = a.id "
"WHERE a.id = ? AND s.accepted = 1 AND a.duplicate_of IS NULL",
(article_id,),
).fetchone()
url = row["image_url"] if row else None url = row["image_url"] if row else None
if not url: path = newsimg.path_for(url) if url else None
raise HTTPException(status_code=404, detail="no image for this article")
path = newsimg.get_or_fetch(url)
if not path: if not path:
raise HTTPException(status_code=404, detail="image unavailable") raise HTTPException(status_code=404, detail="image not cached")
media = "image/webp" if path.suffix == ".webp" else None # NOT immutable: a re-enriched article can change image_url, so the bytes behind a
# Stable per article id (image_url doesn't change) → cache hard at the edge/browser. # given id can change. A day's browser cache is plenty (we're direct-origin, no CDN).
return FileResponse(str(path), media_type=media, return FileResponse(str(path), media_type="image/webp",
headers={"Cache-Control": "public, max-age=31536000, immutable"}) headers={"Cache-Control": "public, max-age=86400"})
@app.api_route("/api/art/image/{object_id}", methods=["GET", "HEAD"]) @app.api_route("/api/art/image/{object_id}", methods=["GET", "HEAD"])
def art_image(object_id: int, size: str = Query("")) -> FileResponse: def art_image(object_id: int, size: str = Query("")) -> FileResponse:
+129 -65
View File
@@ -1,40 +1,53 @@
"""Local image cache + downscale for news article images. """Local image cache + downscale for news article images.
The hub, feed, and article pages used to hotlink each article's image_url straight Article images used to be hotlinked from the source, so a slow/flaky third-party CDN
from the source's server, so a slow / rate-limited / flaky third-party CDN left a left a blank graphic until a refresh. Instead the CYCLE fetches a downscaled WebP copy
blank graphic until a refresh. Instead we cache a downscaled display copy on our own to data/img_cache/ (beside the DB, mounted into the API container, mirrors art_cache),
origin (beside the DB, like art_cache) and serve that. The cache is bounded by a HARD and the API serves only cache HITS — it never fetches, so the public endpoint has no
size ceiling with LRU eviction (prune), so it can't grow without limit no matter the SSRF or worker-exhaustion surface. The cache is bounded by a hard size ceiling with LRU
ingest rate. Network + Pillow calls are isolated so tests can monkeypatch them. eviction, so it can't grow without limit no matter the ingest rate.
Keyed by a hash of the source URL: a given image_url always maps to the same file. Security posture (the fetch runs only in the trusted cycle, but feed image URLs are
The API resolves an article id -> its image_url (a tight allowlist — we only ever still externally supplied, so we treat them as untrusted):
fetch URLs already in our own corpus, so it is not an open proxy).""" * SSRF-safe fetch reuses enrich._host_is_public + bounded redirect re-validation
(same path as feeds.safe_fetch_feed) — no private/loopback/link-local targets,
http(s) only, every redirect hop re-checked.
* Only successfully-decoded RASTER images are re-encoded to WebP and stored; SVG and
anything undecodable is REJECTED (never retained as a same-origin file).
* Decompression-bomb + dimension guards.
* Definitive failures are negative-cached (a .fail marker) so a bad URL isn't refetched
every cycle; transient network errors are not, so they retry.
Concurrency: all fetching happens inside the cycle, which holds an exclusive lock, so no
two fetches race; writes are atomic (temp + rename) regardless.
"""
from __future__ import annotations from __future__ import annotations
import hashlib import hashlib
import io import io
import os import os
import time
import urllib.error
import urllib.request import urllib.request
from pathlib import Path from pathlib import Path
from urllib.parse import urljoin, urlsplit
from .enrich import MAX_REDIRECTS, _NoRedirect, _host_is_public
_UA = {"User-Agent": "upbeatBytes/1.0 (+https://upbeatbytes.com)"} _UA = {"User-Agent": "upbeatBytes/1.0 (+https://upbeatbytes.com)"}
_MIN_IMAGE_BYTES = 500 _MIN_IMAGE_BYTES = 500
_MAX_FETCH_BYTES = 20 * 1024 * 1024 # never pull an absurd original into memory _MAX_FETCH_BYTES = 20 * 1024 * 1024 # never pull an absurd original into memory
_MAX_PIXELS = 50_000_000 # decompression-bomb ceiling (≈50 MP)
_MAX_DIM = 12000 # reject pathological single-axis dimensions
DISPLAY_WIDTH = 800 # cards / feed never show wider than this DISPLAY_WIDTH = 800 # cards / feed never show wider than this
WEBP_QUALITY = 80 WEBP_QUALITY = 80
DEFAULT_CAP_BYTES = 1024 * 1024 * 1024 # 1 GB hard ceiling (override via env) DEFAULT_CAP_BYTES = 1024 * 1024 * 1024 # 1 GB hard ceiling (override via env)
_FAIL_TTL_S = 3 * 24 * 3600 # don't refetch a definitively-bad URL for 3 days
def cache_dir() -> Path: def cache_dir() -> Path:
"""Where cached images live — beside the DB, so the host cycle writes and the API
container reads the same mounted volume (mirrors art.cache_dir)."""
override = os.environ.get("GOODNEWS_IMG_CACHE") override = os.environ.get("GOODNEWS_IMG_CACHE")
if override: db = Path(os.environ.get("GOODNEWS_DB", "data/goodnews.sqlite3"))
d = Path(override) d = Path(override) if override else db.parent / "img_cache"
else:
db = Path(os.environ.get("GOODNEWS_DB", "data/goodnews.sqlite3"))
d = db.parent / "img_cache"
d.mkdir(parents=True, exist_ok=True) d.mkdir(parents=True, exist_ok=True)
return d return d
@@ -50,19 +63,54 @@ def _key(url: str) -> str:
return hashlib.sha1(url.encode("utf-8")).hexdigest() return hashlib.sha1(url.encode("utf-8")).hexdigest()
def _http_bytes(url: str, timeout: int = 12) -> tuple[bytes, str]: class _FetchError(Exception):
req = urllib.request.Request(url, headers=_UA) """permanent=True → negative-cache (won't retry soon); False → transient, retry."""
with urllib.request.urlopen(req, timeout=timeout) as r: def __init__(self, msg: str, permanent: bool):
return r.read(_MAX_FETCH_BYTES + 1), (r.headers.get("Content-Type") or "") super().__init__(msg)
self.permanent = permanent
def _safe_fetch(url: str, timeout: int = 12) -> tuple[bytes, str]:
"""SSRF-safe fetch of an untrusted image URL: http(s) only, every redirect hop
re-validated against public IPs, bounded redirects, body capped. Raises _FetchError
(permanent for policy refusals, transient for network errors)."""
opener = urllib.request.build_opener(_NoRedirect)
current = url
for _ in range(MAX_REDIRECTS + 1):
parts = urlsplit(current)
if parts.scheme not in ("http", "https") or not _host_is_public(parts.hostname):
raise _FetchError(f"non-public or non-http(s): {current}", permanent=True)
req = urllib.request.Request(current, headers=_UA)
try:
resp = opener.open(req, timeout=timeout)
except (urllib.error.URLError, OSError, ValueError) as exc:
raise _FetchError(f"fetch failed: {exc}", permanent=False) from exc
status = getattr(resp, "status", 200) or 200
if status in (301, 302, 303, 307, 308):
loc = resp.headers.get("Location")
resp.close()
if not loc:
raise _FetchError("redirect without location", permanent=True)
current = urljoin(current, loc)
continue
try:
return resp.read(_MAX_FETCH_BYTES + 1), (resp.headers.get("Content-Type") or "")
finally:
resp.close()
raise _FetchError("too many redirects", permanent=True)
def _encode(data: bytes) -> bytes | None: def _encode(data: bytes) -> bytes | None:
"""Downscale to DISPLAY_WIDTH and re-encode as WebP. None if it isn't a decodable """Downscale a decoded RASTER image to DISPLAY_WIDTH and re-encode as WebP. None if
raster image (e.g. SVG) — the caller then stores the original bytes as-is.""" it isn't a decodable raster (e.g. SVG), is a decompression bomb, or has pathological
dimensions — the caller then REJECTS it (never stores arbitrary bytes)."""
try: try:
from PIL import Image from PIL import Image
Image.MAX_IMAGE_PIXELS = _MAX_PIXELS # raise DecompressionBombError past this
im = Image.open(io.BytesIO(data)) im = Image.open(io.BytesIO(data))
im.load() im.load() # forces decode → catches truncated/bomb here
if im.width > _MAX_DIM or im.height > _MAX_DIM or im.width < 1 or im.height < 1:
return None
if im.mode not in ("RGB", "RGBA"): if im.mode not in ("RGB", "RGBA"):
im = im.convert("RGBA" if ("A" in im.mode or im.mode == "P") else "RGB") im = im.convert("RGBA" if ("A" in im.mode or im.mode == "P") else "RGB")
if im.width > DISPLAY_WIDTH: if im.width > DISPLAY_WIDTH:
@@ -71,63 +119,69 @@ def _encode(data: bytes) -> bytes | None:
out = io.BytesIO() out = io.BytesIO()
im.save(out, format="WEBP", quality=WEBP_QUALITY, method=4) im.save(out, format="WEBP", quality=WEBP_QUALITY, method=4)
return out.getvalue() return out.getvalue()
except Exception: # noqa: BLE001 — not a decodable raster image except Exception: # noqa: BLE001 — UnidentifiedImageError, DecompressionBombError, SVG, truncated …
return None return None
def _ext_for(ctype: str) -> str: def _fail_path(url: str) -> Path:
c = ctype.lower() return cache_dir() / f"{_key(url)}.fail"
if "png" in c:
return ".png"
if "gif" in c: def _mark_failed(url: str) -> None:
return ".gif" try:
if "svg" in c: _fail_path(url).touch()
return ".svg" except OSError:
if "webp" in c: pass
return ".webp"
return ".jpg"
def _failed_recently(url: str) -> bool:
try:
return (time.time() - _fail_path(url).stat().st_mtime) < _FAIL_TTL_S
except OSError:
return False
def path_for(url: str) -> Path | None: def path_for(url: str) -> Path | None:
"""The cached file for this URL if present (and bump its mtime, the LRU marker).""" """The cached WebP for this URL if present (and bump its mtime, the LRU marker).
for p in cache_dir().glob(_key(url) + ".*"): A pure cache lookup — never fetches."""
if not url:
return None
p = cache_dir() / f"{_key(url)}.webp"
if p.exists():
try: try:
os.utime(p, None) # touch -> last-used time for LRU eviction os.utime(p, None) # touch last-used time for LRU eviction
except OSError: except OSError:
pass pass
return p return p
return None return None
def get_or_fetch(url: str | None) -> Path | None: def fetch_and_cache(url: str | None) -> Path | None:
"""Cached display copy for a source image URL, fetching + caching on first miss. """Fetch (SSRF-safe), downscale to WebP, and cache atomically. CYCLE-ONLY — the API
Atomic write (temp then rename) so a reader never sees a half-file. None on any endpoint never calls this. None on any failure; definitive failures are negative-cached
failure — callers (endpoint 404 -> frontend retry/typo cover) degrade gracefully.""" so they aren't retried every cycle."""
if not url or not url.startswith(("http://", "https://")): if not url or not url.startswith(("http://", "https://")):
return None return None
hit = path_for(url)
if hit:
return hit
try: try:
data, ctype = _http_bytes(url) data, _ctype = _safe_fetch(url)
except Exception: # noqa: BLE001 — source down/slow/blocked except _FetchError as exc:
if exc.permanent:
_mark_failed(url)
return None return None
if len(data) < _MIN_IMAGE_BYTES or len(data) > _MAX_FETCH_BYTES: if not (_MIN_IMAGE_BYTES <= len(data) <= _MAX_FETCH_BYTES):
_mark_failed(url)
return None return None
encoded = _encode(data) blob = _encode(data)
if encoded is not None: if blob is None: # SVG / undecodable / bomb / bad dimensions
blob, ext = encoded, ".webp" _mark_failed(url)
elif ctype.startswith("image/"):
blob, ext = data, _ext_for(ctype) # couldn't re-encode (e.g. SVG): keep original
else:
return None return None
key = _key(url)
cdir = cache_dir() cdir = cache_dir()
key = _key(url)
tmp = cdir / f".{key}.tmp" tmp = cdir / f".{key}.tmp"
dest = cdir / f"{key}{ext}" dest = cdir / f"{key}.webp"
try: try:
tmp.write_bytes(blob) tmp.write_bytes(blob)
os.replace(tmp, dest) # atomic os.replace(tmp, dest) # atomic
except OSError: except OSError:
try: try:
tmp.unlink() tmp.unlink()
@@ -138,9 +192,9 @@ def get_or_fetch(url: str | None) -> Path | None:
def warm(conn, limit: int = 200) -> int: def warm(conn, limit: int = 200) -> int:
"""Pre-fetch display copies for the newest accepted articles that have an image, so """Pre-fetch display copies for the newest ACCEPTED, CANONICAL articles that have an
the FIRST page view is already a local hit (no first-view flakiness). Bounded; skips image, so the API only ever serves cache hits. Bounded; skips already-cached and
already-cached. Returns how many it newly cached.""" recently-failed URLs. Returns how many it newly cached."""
rows = conn.execute( rows = conn.execute(
"SELECT DISTINCT a.image_url FROM article_scores s JOIN articles a ON a.id = s.article_id " "SELECT DISTINCT a.image_url FROM article_scores s JOIN articles a ON a.id = s.article_id "
"WHERE s.accepted=1 AND a.duplicate_of IS NULL AND a.image_url IS NOT NULL " "WHERE s.accepted=1 AND a.duplicate_of IS NULL AND a.image_url IS NOT NULL "
@@ -150,21 +204,31 @@ def warm(conn, limit: int = 200) -> int:
made = 0 made = 0
for r in rows: for r in rows:
url = r[0] url = r[0]
if path_for(url): if path_for(url) or _failed_recently(url):
continue continue
if get_or_fetch(url): if fetch_and_cache(url):
made += 1 made += 1
return made return made
def prune(cap: int | None = None) -> dict: def prune(cap: int | None = None) -> dict:
"""Enforce the size ceiling: delete least-recently-used files (oldest mtime first) """Enforce the size ceiling: delete least-recently-used WebPs (oldest mtime first)
until the cache is under the cap. Returns {before, after, removed, cap}.""" until under the cap; also sweep stale .fail markers. Returns {before, after, removed, cap}."""
if cap is None: if cap is None:
cap = cap_bytes() cap = cap_bytes()
now = time.time()
files, total = [], 0 files, total = [], 0
for p in cache_dir().iterdir(): for p in cache_dir().iterdir():
if not p.is_file() or p.name.startswith("."): if p.name.startswith("."):
continue
if p.suffix == ".fail":
try:
if now - p.stat().st_mtime >= _FAIL_TTL_S:
p.unlink()
except OSError:
pass
continue
if p.suffix != ".webp" or not p.is_file():
continue continue
try: try:
st = p.stat() st = p.stat()
+13 -4
View File
@@ -98,13 +98,17 @@ _TOP_BAR_CSS = """
""" """
# Burger toggle + signed-in avatar (read from the SPA's localStorage cache, same as HubBar). # Burger toggle + signed-in avatar (read from the SPA's localStorage cache, same as HubBar).
# Parity with HubBar: Escape closes the menu, and crossing back to desktop width resets it.
_TOP_BAR_JS = """<script> _TOP_BAR_JS = """<script>
(function(){ (function(){
var b=document.querySelector('[data-burger]'), m=document.querySelector('[data-menu]'); var b=document.querySelector('[data-burger]'), m=document.querySelector('[data-menu]');
function close(){ if(m) m.setAttribute('hidden',''); if(b){ b.classList.remove('open'); b.setAttribute('aria-expanded','false'); } }
if(b&&m){ b.addEventListener('click',function(){ if(b&&m){ b.addEventListener('click',function(){
if(m.hasAttribute('hidden')){ m.removeAttribute('hidden'); b.classList.add('open'); b.setAttribute('aria-expanded','true'); } if(m.hasAttribute('hidden')){ m.removeAttribute('hidden'); b.classList.add('open'); b.setAttribute('aria-expanded','true'); }
else { m.setAttribute('hidden',''); b.classList.remove('open'); b.setAttribute('aria-expanded','false'); } else { close(); }
}); } }); }
document.addEventListener('keydown',function(e){ if(e.key==='Escape') close(); });
if(window.matchMedia){ var mq=window.matchMedia('(min-width:721px)'); if(mq.addEventListener) mq.addEventListener('change',function(e){ if(e.matches) close(); }); }
try{ try{
var u=JSON.parse(localStorage.getItem('goodnews:auth_user')||'null'); var u=JSON.parse(localStorage.getItem('goodnews:auth_user')||'null');
if(u&&u.avatar_url){ if(u&&u.avatar_url){
@@ -117,12 +121,14 @@ _TOP_BAR_JS = """<script>
})(); })();
</script>""" </script>"""
# Single-history Back: return to where you came from in-app, else home (mirrors HubShell). # Single-history Back (mirrors HubShell): go back ONLY when we arrived from our own origin,
# else go home — never bounce the reader off to an external referrer.
_BACK_JS = """<script> _BACK_JS = """<script>
(function(){ (function(){
var b=document.querySelector('[data-back]'); if(!b) return; var b=document.querySelector('[data-back]'); if(!b) return;
b.addEventListener('click',function(){ b.addEventListener('click',function(){
if(document.referrer && history.length>1){ history.back(); } else { location.href='/'; } var same=false; try{ same=!!document.referrer && new URL(document.referrer).origin===location.origin; }catch(e){}
if(same && history.length>1){ history.back(); } else { location.href='/'; }
}); });
})(); })();
</script>""" </script>"""
@@ -165,8 +171,11 @@ def render_share_page(article: dict, base_url: str, summary: str | None = None,
# The visible image is served from our cached/downscaled copy (not a hotlink), so a # The visible image is served from our cached/downscaled copy (not a hotlink), so a
# flaky source CDN can't blank it. og:image/twitter:image above stay the source URL # flaky source CDN can't blank it. og:image/twitter:image above stay the source URL
# so social crawlers fetch the canonical image directly. # so social crawlers fetch the canonical image directly.
# Served from our cache (/api/img/<id>); if it isn't cached yet / fails, drop the
# element so the page degrades to the clean text unfurl rather than a broken icon.
media = ( media = (
f'<img class="media" src="/api/img/{aid}" alt="" referrerpolicy="no-referrer">' f'<img class="media" src="/api/img/{aid}" alt="" referrerpolicy="no-referrer" '
f'onerror="this.remove()">'
if image else "" if image else ""
) )
+109 -46
View File
@@ -1,6 +1,6 @@
"""News image cache: fetch + downscale to WebP, cache-hit (no refetch), graceful """News image cache (hardened): SSRF-safe cycle-only fetch + downscale to WebP, reject
failure, LRU eviction under a size cap, cycle warm, and the /api/img endpoint SVG/non-raster/private-host/redirect-to-private, negative-cache failures, LRU eviction,
(allowlisted to our own corpus).""" and the cache-HITS-ONLY /api/img endpoint (restricted to accepted, canonical articles)."""
import io import io
import os import os
@@ -11,6 +11,10 @@ from fastapi.testclient import TestClient
from goodnews import newsimg from goodnews import newsimg
from goodnews.db import connect, init_db from goodnews.db import connect, init_db
# The real SSRF-safe fetch, captured before any fixture stubs it — the SSRF tests below
# restore it so they exercise the genuine host/redirect validation.
_REAL_SAFE_FETCH = newsimg._safe_fetch
def _png(w=1600, h=1000, color=(40, 130, 173)): def _png(w=1600, h=1000, color=(40, 130, 173)):
out = io.BytesIO() out = io.BytesIO()
@@ -21,81 +25,140 @@ def _png(w=1600, h=1000, color=(40, 130, 173)):
@pytest.fixture @pytest.fixture
def cache(tmp_path, monkeypatch): def cache(tmp_path, monkeypatch):
monkeypatch.setenv("GOODNEWS_IMG_CACHE", str(tmp_path / "img_cache")) monkeypatch.setenv("GOODNEWS_IMG_CACHE", str(tmp_path / "img_cache"))
# default: treat hosts as public + a happy PNG fetch; individual tests override.
monkeypatch.setattr(newsimg, "_host_is_public", lambda h: True)
monkeypatch.setattr(newsimg, "_safe_fetch", lambda url, timeout=12: (_png(), "image/png"))
return tmp_path / "img_cache" return tmp_path / "img_cache"
def test_get_or_fetch_caches_and_downscales(cache, monkeypatch): def test_fetch_and_cache_downscales_to_webp(cache):
calls = [] p = newsimg.fetch_and_cache("https://example.com/big.png")
monkeypatch.setattr(newsimg, "_http_bytes",
lambda url, timeout=12: (calls.append(url), (_png(), "image/png"))[1])
p = newsimg.get_or_fetch("https://example.com/big.png")
assert p and p.exists() and p.suffix == ".webp" assert p and p.exists() and p.suffix == ".webp"
with Image.open(p) as im: # downscaled + re-encoded with Image.open(p) as im:
assert im.width == newsimg.DISPLAY_WIDTH and im.format == "WEBP" assert im.width == newsimg.DISPLAY_WIDTH and im.format == "WEBP"
assert newsimg.get_or_fetch("https://example.com/big.png") == p # cache hit assert newsimg.path_for("https://example.com/big.png") == p # pure cache lookup
assert len(calls) == 1 # ...not refetched
def test_get_or_fetch_rejects_non_image_and_bad_scheme(cache, monkeypatch): def test_bad_scheme_and_empty_are_rejected(cache):
monkeypatch.setattr(newsimg, "_http_bytes", assert newsimg.fetch_and_cache(None) is None
lambda url, timeout=12: (b"<html>nope</html>", "text/html")) assert newsimg.fetch_and_cache("ftp://example.com/x.png") is None
assert newsimg.get_or_fetch("https://example.com/page.html") is None assert not list(cache.glob("*.webp"))
assert newsimg.get_or_fetch(None) is None
assert newsimg.get_or_fetch("ftp://example.com/x.png") is None # http(s) only (no SSRF surface)
def test_fetch_failure_returns_none(cache, monkeypatch): def test_svg_and_nonimage_rejected_and_negative_cached(cache, monkeypatch):
monkeypatch.setattr(newsimg, "_safe_fetch",
lambda url, timeout=12: (b"<svg xmlns='http://www.w3.org/2000/svg'></svg>" + b" " * 600,
"image/svg+xml"))
url = "https://example.com/evil.svg"
assert newsimg.fetch_and_cache(url) is None # decoded-raster only → SVG refused
assert not list(cache.glob("*.webp")) # nothing stored
assert newsimg._failed_recently(url) # negative-cached (won't refetch)
def test_private_host_refused_and_negative_cached(cache, monkeypatch):
monkeypatch.setattr(newsimg, "_safe_fetch", _REAL_SAFE_FETCH) # exercise real validation
monkeypatch.setattr(newsimg, "_host_is_public", lambda h: False)
url = "http://169.254.169.254/latest/meta-data/"
assert newsimg.fetch_and_cache(url) is None
assert newsimg._failed_recently(url) # SSRF target → permanent fail
def test_redirect_to_private_is_refused(cache, monkeypatch):
monkeypatch.setattr(newsimg, "_safe_fetch", _REAL_SAFE_FETCH) # exercise real redirect re-validation
hosts = {"good.example": True, "evil.internal": False}
monkeypatch.setattr(newsimg, "_host_is_public", lambda h: hosts.get(h, False))
class _Resp:
def __init__(self, status, headers, body=b""):
self.status, self.headers, self._b = status, headers, body
def read(self, n=-1):
return self._b
def close(self):
pass
class _Opener:
def open(self, req, timeout=None):
if "good.example" in req.full_url: # public → 302 to private
return _Resp(302, {"Location": "http://evil.internal/x.png"})
return _Resp(200, {"Content-Type": "image/png"}, _png()) # should never be reached
monkeypatch.setattr(newsimg.urllib.request, "build_opener", lambda *a: _Opener())
url = "http://good.example/p.png"
assert newsimg.fetch_and_cache(url) is None # redirect hop re-validated → refused
assert newsimg._failed_recently(url)
def test_transient_failure_is_not_negative_cached(cache, monkeypatch):
def boom(url, timeout=12): def boom(url, timeout=12):
raise OSError("source down") raise newsimg._FetchError("network down", permanent=False)
monkeypatch.setattr(newsimg, "_http_bytes", boom) monkeypatch.setattr(newsimg, "_safe_fetch", boom)
assert newsimg.get_or_fetch("https://example.com/x.jpg") is None url = "https://example.com/x.jpg"
assert newsimg.fetch_and_cache(url) is None
assert not newsimg._failed_recently(url) # transient → retried next cycle
def test_prune_evicts_least_recently_used_over_cap(cache, monkeypatch): def test_prune_evicts_least_recently_used_over_cap(cache):
monkeypatch.setattr(newsimg, "_http_bytes", lambda url, timeout=12: (_png(), "image/png")) paths = [newsimg.fetch_and_cache(f"https://example.com/{i}.png") for i in range(5)]
paths = [newsimg.get_or_fetch(f"https://example.com/{i}.png") for i in range(5)] for i, p in enumerate(paths): # 0 = oldest/LRU, 4 = newest
for i, p in enumerate(paths): # 0 = oldest/LRU, 4 = newest
os.utime(p, (1000 + i, 1000 + i)) os.utime(p, (1000 + i, 1000 + i))
sizes = [p.stat().st_size for p in paths] sizes = [p.stat().st_size for p in paths]
cap = sum(sizes) - sizes[0] - sizes[1] + 1 # room for the 3 newest only cap = sum(sizes) - sizes[0] - sizes[1] + 1 # room for the 3 newest only
r = newsimg.prune(cap) r = newsimg.prune(cap)
assert r["removed"] == 2 and r["after"] <= cap assert r["removed"] == 2 and r["after"] <= cap
assert not paths[0].exists() and not paths[1].exists() # the two oldest evicted assert not paths[0].exists() and not paths[1].exists()
assert paths[2].exists() and paths[4].exists() # newer kept assert paths[2].exists() and paths[4].exists()
def test_warm_caches_recent_accepted_with_image(cache, monkeypatch): def test_warm_skips_cached_and_failed(cache, monkeypatch):
monkeypatch.setattr(newsimg, "_http_bytes", lambda url, timeout=12: (_png(), "image/png"))
conn = connect(":memory:"); init_db(conn) conn = connect(":memory:"); init_db(conn)
conn.execute("INSERT INTO sources (id,name,feed_url) VALUES (1,'S','http://s/f')") conn.execute("INSERT INTO sources (id,name,feed_url) VALUES (1,'S','http://s/f')")
for aid, img in ((1, "https://x/1.jpg"), (2, "https://x/2.jpg"), (3, "")): for aid, img, acc, dup in ((1, "https://x/1.jpg", 1, None), (2, "https://x/2.jpg", 1, None),
conn.execute("INSERT INTO articles (id,source_id,canonical_url,title,url_hash,image_url) " (3, "https://x/3.jpg", 0, None), (4, "https://x/4.jpg", 1, 1)):
"VALUES (?,1,?,?,?,?)", (aid, f"http://s/{aid}", f"t{aid}", f"h{aid}", img)) conn.execute("INSERT INTO articles (id,source_id,canonical_url,title,url_hash,image_url,duplicate_of) "
conn.execute("INSERT INTO article_scores (article_id, accepted) VALUES (?,1)", (aid,)) "VALUES (?,1,?,?,?,?,?)", (aid, f"http://s/{aid}", f"t{aid}", f"h{aid}", img, dup))
conn.execute("INSERT INTO article_scores (article_id, accepted) VALUES (?,?)", (aid, acc))
conn.commit() conn.commit()
assert newsimg.warm(conn) == 2 # the two with an image assert newsimg.warm(conn) == 2 # only the 2 accepted, canonical, with image
assert newsimg.warm(conn) == 0 # idempotent (already cached) assert newsimg.warm(conn) == 0 # idempotent already cached
@pytest.fixture @pytest.fixture
def client(tmp_path, monkeypatch): def client(tmp_path, monkeypatch):
monkeypatch.setenv("GOODNEWS_DB", str(tmp_path / "t.sqlite3")) monkeypatch.setenv("GOODNEWS_DB", str(tmp_path / "t.sqlite3"))
monkeypatch.setenv("GOODNEWS_IMG_CACHE", str(tmp_path / "img_cache")) monkeypatch.setenv("GOODNEWS_IMG_CACHE", str(tmp_path / "img_cache"))
monkeypatch.setattr(newsimg, "_http_bytes", lambda url, timeout=12: (_png(), "image/png")) monkeypatch.setattr(newsimg, "_host_is_public", lambda h: True)
monkeypatch.setattr(newsimg, "_safe_fetch", lambda url, timeout=12: (_png(), "image/png"))
conn = connect(tmp_path / "t.sqlite3"); init_db(conn) conn = connect(tmp_path / "t.sqlite3"); init_db(conn)
conn.execute("INSERT INTO sources (id,name,feed_url) VALUES (1,'S','http://s/f')") conn.execute("INSERT INTO sources (id,name,feed_url) VALUES (1,'S','http://s/f')")
conn.execute("INSERT INTO articles (id,source_id,canonical_url,title,url_hash,image_url) " # 1 = accepted+canonical+image, 2 = no image, 3 = NOT accepted, 4 = duplicate
"VALUES (1,1,'http://s/1','t1','h1','https://x/pic.jpg')") rows = ((1, "https://x/pic.jpg", 1, None), (2, "", 1, None),
conn.execute("INSERT INTO articles (id,source_id,canonical_url,title,url_hash,image_url) " (3, "https://x/p3.jpg", 0, None), (4, "https://x/p4.jpg", 1, 1))
"VALUES (2,1,'http://s/2','t2','h2','')") # no image for aid, img, acc, dup in rows:
conn.commit(); conn.close() conn.execute("INSERT INTO articles (id,source_id,canonical_url,title,url_hash,image_url,duplicate_of) "
"VALUES (?,1,?,?,?,?,?)", (aid, f"http://s/{aid}", f"t{aid}", f"h{aid}", img, dup))
conn.execute("INSERT INTO article_scores (article_id, accepted) VALUES (?,?)", (aid, acc))
conn.commit()
# the cycle owns fetching — pre-warm so the endpoint has a hit to serve
newsimg.warm(conn)
conn.close()
from goodnews.api import create_app from goodnews.api import create_app
return TestClient(create_app()) return TestClient(create_app())
def test_img_endpoint_serves_and_allowlists(client): def test_img_endpoint_serves_cached_and_restricts(client):
r = client.get("/api/img/1") r = client.get("/api/img/1")
assert r.status_code == 200 and r.headers["content-type"] == "image/webp" assert r.status_code == 200 and r.headers["content-type"] == "image/webp"
assert "immutable" in r.headers.get("cache-control", "") assert "immutable" not in r.headers.get("cache-control", "") # no 1-year immutable
assert client.get("/api/img/2").status_code == 404 # article has no image assert client.get("/api/img/2").status_code == 404 # no image
assert client.get("/api/img/999").status_code == 404 # unknown id (not in corpus) assert client.get("/api/img/3").status_code == 404 # not accepted
assert client.get("/api/img/4").status_code == 404 # duplicate (non-canonical)
assert client.get("/api/img/999").status_code == 404 # unknown id
def test_img_endpoint_never_fetches(client, monkeypatch):
# An accepted article whose image was never warmed must 404, not trigger a fetch.
called = {"n": 0}
monkeypatch.setattr(newsimg, "fetch_and_cache", lambda *a, **k: called.__setitem__("n", called["n"] + 1))
# /api/img/1 is cached (warmed in fixture) → still served without any fetch
assert client.get("/api/img/1").status_code == 200
assert called["n"] == 0
+12
View File
@@ -105,3 +105,15 @@ def test_complete_page_is_cached_and_served_from_cache(app_complete):
assert 1 in api._SHARE_CACHE # finished page cached assert 1 in api._SHARE_CACHE # finished page cached
r2 = tc.get("/a/1") # second hit served from cache r2 = tc.get("/a/1") # second hit served from cache
assert r2.status_code == 200 and r2.text == r1.text assert r2.status_code == 200 and r2.text == r1.text
def test_share_chrome_parity_and_fallbacks():
"""HubBar-replica parity Codex asked for: Back only follows a SAME-ORIGIN referrer,
the menu honors Escape, and a failed image removes itself (clean text unfurl)."""
from goodnews import share
html = share.render_share_page(
{"id": 7, "title": "T", "image_url": "https://x/p.jpg", "canonical_url": "https://s/x",
"source_name": "S"}, "https://upbeatbytes.com")
assert "origin===location.origin" in html # Back requires same-origin referrer
assert "'Escape'" in html or "Escape" in html # menu closes on Escape (parity)
assert 'onerror="this.remove()"' in html # broken image degrades to text unfurl