news: harden paywall exclusion at the candidate query + add the missing regressions
Codex's two non-blocking hardening items, folded in before cutover: - _candidate_articles() now excludes paywalled sources IN-QUERY (before LIMIT 50), so flagged stories can't consume candidate slots and leave a full brief thin. Dropped the now-redundant post-fetch filter in build_daily_brief. - Regressions: history retains a viewed paywalled article; sitemap omits a paywalled source AND restores it under override="free". - Aligned test_brief_paywall to the source-level model (paywalled sources carry a paywalled homepage, as in production) — it had relied on article-URL detection. 425 backend tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
+9
-5
@@ -4,6 +4,7 @@ import sqlite3
|
||||
|
||||
from .localtime import local_today
|
||||
from .paywall import is_paywalled, is_paywalled_for_source
|
||||
from .queries import paywalled_source_ids
|
||||
|
||||
|
||||
def build_daily_brief(
|
||||
@@ -17,9 +18,8 @@ def build_daily_brief(
|
||||
|
||||
# Compose the selection first so we can tell whether anything actually changed.
|
||||
# A calm daily brief never hands the reader a locked door: paywalled-source
|
||||
# candidates are excluded outright (no unreadable news), not just demoted.
|
||||
# candidates are excluded in _candidate_articles (before LIMIT) — no unreadable news.
|
||||
rows = _candidate_articles(conn, target_date, window_days)
|
||||
rows = [r for r in rows if not is_paywalled_for_source(r["canonical_url"], r["paywall_override"])]
|
||||
selected = _select_diverse(rows, limit)
|
||||
selected_ids = [row["id"] for row in selected]
|
||||
|
||||
@@ -107,10 +107,13 @@ def _candidate_articles(
|
||||
`window_days` so the brief still fills on slow news days. Anything already
|
||||
featured in a brief within the last 7 days (other than this same date, which
|
||||
is being rebuilt) is excluded so backfilled stories cannot linger across
|
||||
consecutive days.
|
||||
consecutive days. Paywalled sources are excluded here (before LIMIT) so they
|
||||
can't consume candidate slots and leave an otherwise-full brief thin.
|
||||
"""
|
||||
pwx = paywalled_source_ids(conn)
|
||||
pw_clause = f"AND a.source_id NOT IN ({','.join('?' * len(pwx))})" if pwx else ""
|
||||
return conn.execute(
|
||||
"""
|
||||
f"""
|
||||
SELECT
|
||||
a.id,
|
||||
a.title,
|
||||
@@ -152,6 +155,7 @@ def _candidate_articles(
|
||||
AND b.brief_date <= date(?)
|
||||
AND b.brief_date > date(?, '-7 days')
|
||||
)
|
||||
{pw_clause}
|
||||
ORDER BY
|
||||
is_today DESC,
|
||||
(s.constructive_score + s.agency_score + s.human_benefit_score + src.trust_score
|
||||
@@ -159,7 +163,7 @@ def _candidate_articles(
|
||||
COALESCE(a.published_at, a.discovered_at) DESC
|
||||
LIMIT 50
|
||||
""",
|
||||
(target_date, target_date, target_date, window_days, target_date, target_date, target_date),
|
||||
(target_date, target_date, target_date, window_days, target_date, target_date, target_date, *pwx),
|
||||
).fetchall()
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user