[00:02:19] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1231057 [00:02:22] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1231058 [00:02:25] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1231059 [00:06:15] !log setting batphone for SRE escalations [00:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:34] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1231058 (owner: 10Ncmonitor) [00:35:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio200[1-3] - https://phabricator.wikimedia.org/T405982#11550681 (10Dwisehaupt) @Jhancock.wm Could we do this some time the week of 1/26? Any time that works for you would be ok, just let us know and we'll keep an eye. So... [00:40:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1231071 [00:40:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1231071 (owner: 10TrainBranchBot) [00:51:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87899 and previous config saved to /var/cache/conftool/dbconfig/20260124-005145-marostegui.json [00:51:52] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:51:53] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:52:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1231071 (owner: 10TrainBranchBot) [01:01:06] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P87900 and previous config saved to /var/cache/conftool/dbconfig/20260124-010153-marostegui.json [01:05:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:11:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1231090 [01:11:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1231090 (owner: 10TrainBranchBot) [01:12:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P87901 and previous config saved to /var/cache/conftool/dbconfig/20260124-011201-marostegui.json [01:12:08] !log `!log` messages from the `#wikimedia-fundraising` IRC channel now log to https://wikitech.wikimedia.org/wiki/Fundraising/SAL and https://sal.toolforge.org/fundraising (T415389) [01:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:13] T415389: Create a Fundraising specific SAL - https://phabricator.wikimedia.org/T415389 [01:14:08] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 01s) [01:18:58] (03CR) 10RLazarus: [C:04-1] "That's https://gitlab.wikimedia.org/repos/sre/sophroid/-/merge_requests/9, which this Depends-On in spirit." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230544 (owner: 10RLazarus) [01:22:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87902 and previous config saved to /var/cache/conftool/dbconfig/20260124-012210-marostegui.json [01:22:17] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [01:22:18] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [01:22:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [01:22:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1178 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87903 and previous config saved to /var/cache/conftool/dbconfig/20260124-012234-marostegui.json [01:29:40] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:35:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1231090 (owner: 10TrainBranchBot) [02:32:04] (03CR) 10Scott French: [C:03+1] "Thanks for checking! Sophroid change looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230544 (owner: 10RLazarus) [03:12:17] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:05:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:13] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:29:40] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:13] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:09] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:31:53] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11550863 (10Tacsipacsi) [06:31:55] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11550865 (10Tacsipacsi) Couldn’t the non-standard URLs redirect to the standard ones? So if I write `lang=html FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:34:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:52:11] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#11551235 (10TheDJ) 05Stalled→03Resolved a:03TheDJ @MatthewVernon can this be closed ? [15:52:33] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#11551238 (10TheDJ) 05Resolved→03Open a:05TheDJ→03None [16:12:19] 06SRE, 06Traffic: TLS 1.2 on Wikimedia DNS not working - https://phabricator.wikimedia.org/T415449 (10Naruse_shiroha) 03NEW [16:13:05] 06SRE, 06Traffic: TLS 1.2 on Wikimedia DNS not working - https://phabricator.wikimedia.org/T415449#11551260 (10Naruse_shiroha) [16:21:11] 06SRE, 06Traffic: TLS 1.2 on Wikimedia DNS not working - https://phabricator.wikimedia.org/T415449#11551263 (10Naruse_shiroha) [16:22:27] 06SRE, 06Traffic: TLS 1.2 on Wikimedia DNS DoH resolver not working - https://phabricator.wikimedia.org/T415449#11551266 (10Naruse_shiroha) [16:29:43] 06SRE, 06Traffic: TLS 1.2 on Wikimedia DNS DoH resolver not working - https://phabricator.wikimedia.org/T415449#11551267 (10Naruse_shiroha) From this file, the incapablity of TLS 1.2 on DoH resolver seems to be intended. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/... [17:29:40] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:18:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:34:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:38:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:47:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87904 and previous config saved to /var/cache/conftool/dbconfig/20260124-200358-marostegui.json [20:04:05] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:04:05] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:05:38] 06SRE, 06Traffic, 07Documentation: TLS 1.2 on Wikimedia DNS DoH resolver not working - https://phabricator.wikimedia.org/T415449#11551386 (10Reedy) {197af23bffc3b5f6272d3885d830709d1a59af57} ` wikidough: set TLSv1.2 as the minimum version for DoT In the current version of dnsdist's TLS configuration, the m... [20:14:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P87905 and previous config saved to /var/cache/conftool/dbconfig/20260124-201406-marostegui.json [20:24:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P87906 and previous config saved to /var/cache/conftool/dbconfig/20260124-202414-marostegui.json [20:34:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87907 and previous config saved to /var/cache/conftool/dbconfig/20260124-203423-marostegui.json [20:34:30] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:34:30] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:34:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [20:34:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87908 and previous config saved to /var/cache/conftool/dbconfig/20260124-203447-marostegui.json [21:08:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:13:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:29:40] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:39:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230498 (https://phabricator.wikimedia.org/T414992) (owner: 10Seawolf35gerrit) [23:12:32] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:32:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87909 and previous config saved to /var/cache/conftool/dbconfig/20260124-233255-marostegui.json [23:33:02] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [23:33:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [23:34:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:43:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P87910 and previous config saved to /var/cache/conftool/dbconfig/20260124-234304-marostegui.json [23:47:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P87911 and previous config saved to /var/cache/conftool/dbconfig/20260124-235312-marostegui.json