news: harden paywall exclusion at the candidate query + add the missing regressions

Codex's two non-blocking hardening items, folded in before cutover:
- _candidate_articles() now excludes paywalled sources IN-QUERY (before LIMIT 50),
  so flagged stories can't consume candidate slots and leave a full brief thin.
  Dropped the now-redundant post-fetch filter in build_daily_brief.
- Regressions: history retains a viewed paywalled article; sitemap omits a
  paywalled source AND restores it under override="free".
- Aligned test_brief_paywall to the source-level model (paywalled sources carry a
  paywalled homepage, as in production) — it had relied on article-URL detection.

425 backend tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
jay
2026-06-28 18:54:53 -04:00
parent c600145ba5
commit 1c1ecefde8
4 changed files with 50 additions and 9 deletions
+9 -5
View File
@@ -4,6 +4,7 @@ import sqlite3
from .localtime import local_today
from .paywall import is_paywalled, is_paywalled_for_source
from .queries import paywalled_source_ids
def build_daily_brief(
@@ -17,9 +18,8 @@ def build_daily_brief(
# Compose the selection first so we can tell whether anything actually changed.
# A calm daily brief never hands the reader a locked door: paywalled-source
# candidates are excluded outright (no unreadable news), not just demoted.
# candidates are excluded in _candidate_articles (before LIMIT) — no unreadable news.
rows = _candidate_articles(conn, target_date, window_days)
rows = [r for r in rows if not is_paywalled_for_source(r["canonical_url"], r["paywall_override"])]
selected = _select_diverse(rows, limit)
selected_ids = [row["id"] for row in selected]
@@ -107,10 +107,13 @@ def _candidate_articles(
`window_days` so the brief still fills on slow news days. Anything already
featured in a brief within the last 7 days (other than this same date, which
is being rebuilt) is excluded so backfilled stories cannot linger across
consecutive days.
consecutive days. Paywalled sources are excluded here (before LIMIT) so they
can't consume candidate slots and leave an otherwise-full brief thin.
"""
pwx = paywalled_source_ids(conn)
pw_clause = f"AND a.source_id NOT IN ({','.join('?' * len(pwx))})" if pwx else ""
return conn.execute(
"""
f"""
SELECT
a.id,
a.title,
@@ -152,6 +155,7 @@ def _candidate_articles(
AND b.brief_date <= date(?)
AND b.brief_date > date(?, '-7 days')
)
{pw_clause}
ORDER BY
is_today DESC,
(s.constructive_score + s.agency_score + s.human_benefit_score + src.trust_score
@@ -159,7 +163,7 @@ def _candidate_articles(
COALESCE(a.published_at, a.discovered_at) DESC
LIMIT 50
""",
(target_date, target_date, target_date, window_days, target_date, target_date, target_date),
(target_date, target_date, target_date, window_days, target_date, target_date, target_date, *pwx),
).fetchall()