[00:04:10] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:26:57] PROBLEM - Host an-worker1224 is DOWN: PING CRITICAL - Packet loss = 80%, RTA = 6925.25 ms [00:27:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:27:49] RECOVERY - Host an-worker1224 is UP: PING WARNING - Packet loss = 33%, RTA = 1433.64 ms [00:32:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:40:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1225195 [00:40:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1225195 (owner: 10TrainBranchBot) [00:40:54] (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225197 (https://phabricator.wikimedia.org/T344471) [00:45:10] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225197 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [00:47:39] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225197 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [00:49:26] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [00:49:44] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:49:46] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:50:08] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:50:10] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:50:30] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:53:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1225195 (owner: 10TrainBranchBot) [01:09:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1225205 [01:09:56] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1225205 (owner: 10TrainBranchBot) [01:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:35:32] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1225205 (owner: 10TrainBranchBot) [02:46:21] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:53:19] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:19:10] FIRING: [6x] ProbeDown: Service wdqs1024:443 has failed probes (http_wdqs_scholarly_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:05] PROBLEM - Host doh7004 is DOWN: CRITICAL - Time to live exceeded (195.200.68.101) [03:21:05] PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98) [03:21:29] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 138.10 ms [03:21:31] RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 138.05 ms [03:44:37] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale-full only: 1 (gitlab1004), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [03:52:44] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:56:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87239 and previous config saved to /var/cache/conftool/dbconfig/20260112-035635-marostegui.json [03:56:41] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:56:41] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:04:10] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:06:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P87240 and previous config saved to /var/cache/conftool/dbconfig/20260112-040643-marostegui.json [04:16:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P87241 and previous config saved to /var/cache/conftool/dbconfig/20260112-041652-marostegui.json [04:27:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87242 and previous config saved to /var/cache/conftool/dbconfig/20260112-042700-marostegui.json [04:27:06] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:27:07] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:27:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2239.codfw.wmnet with reason: Maintenance [04:36:21] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:37:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:46:21] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:57:15] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:57:15] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:08:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87243 and previous config saved to /var/cache/conftool/dbconfig/20260112-050821-marostegui.json [05:08:27] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:08:27] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:18:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P87244 and previous config saved to /var/cache/conftool/dbconfig/20260112-051829-marostegui.json [05:22:07] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55565 bytes in 2.314 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:22:07] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 2.484 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:22:15] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:26:21] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P87245 and previous config saved to /var/cache/conftool/dbconfig/20260112-052838-marostegui.json [05:29:15] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:34:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:21] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:35:19] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:38:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87246 and previous config saved to /var/cache/conftool/dbconfig/20260112-053846-marostegui.json [05:38:52] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:38:52] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:39:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [05:51:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1236.eqiad.wmnet with reason: Maintenance [05:51:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2218.codfw.wmnet with reason: Maintenance [05:55:42] !log Disable GTID on db1195 for testing T315642 [05:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:46] T315642: Monitor GTID status - https://phabricator.wikimedia.org/T315642 [05:57:51] (03PS1) 10Marostegui: instances.yaml: Add db2249 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1225274 (https://phabricator.wikimedia.org/T407941) [05:58:38] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2249 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1225274 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui) [06:01:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2249 to dbctl T407941', diff saved to https://phabricator.wikimedia.org/P87247 and previous config saved to /var/cache/conftool/dbconfig/20260112-060128-marostegui.json [06:01:32] T407941: Productionize x1 expansion hosts - https://phabricator.wikimedia.org/T407941 [06:03:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2204.codfw.wmnet with reason: Maintenance [06:12:47] 06SRE, 06Traffic: Move contact info detection at the edge to a lua module - https://phabricator.wikimedia.org/T414300 (10Joe) 03NEW [06:13:08] 06SRE, 06Traffic: Move contact info detection at the edge to a lua module - https://phabricator.wikimedia.org/T414300#11510940 (10Joe) p:05Triage→03High a:03Joe [06:14:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2205.codfw.wmnet with reason: Maintenance [06:14:57] (03PS1) 10Marostegui: db2249: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1225281 (https://phabricator.wikimedia.org/T407941) [06:15:58] (03CR) 10Marostegui: [C:03+2] db2249: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1225281 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui) [06:17:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: Maintenance [06:17:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:18:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1167 (T413525)', diff saved to https://phabricator.wikimedia.org/P87248 and previous config saved to /var/cache/conftool/dbconfig/20260112-061800-marostegui.json [06:18:04] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:19:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2249: Pooling for the first time in x1 T407941 [06:19:59] T407941: Productionize x1 expansion hosts - https://phabricator.wikimedia.org/T407941 [06:20:13] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.newpool (exit_code=97) pool db2249: Pooling for the first time in x1 T407941 [06:23:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T413525)', diff saved to https://phabricator.wikimedia.org/P87249 and previous config saved to /var/cache/conftool/dbconfig/20260112-062300-marostegui.json [06:23:05] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:23:31] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2249: Pooling for the first time in x1 T407941 [06:23:41] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.newpool (exit_code=97) pool db2249: Pooling for the first time in x1 T407941 [06:24:29] (03CR) 10Marostegui: "I was trying to pool a host using this cookbook but I got: https://phabricator.wikimedia.org/P87250" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [06:24:42] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2249.codfw.wmnet: Pooling for the first time in x1 T407941 [06:24:43] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.newpool (exit_code=97) pool db2249.codfw.wmnet: Pooling for the first time in x1 T407941 [06:33:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P87251 and previous config saved to /var/cache/conftool/dbconfig/20260112-063309-marostegui.json [06:42:30] (03PS1) 10ZhaoFJx: zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) [06:43:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P87252 and previous config saved to /var/cache/conftool/dbconfig/20260112-064317-marostegui.json [06:47:12] 06SRE, 06Infrastructure-Foundations, 10netops: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11510983 (10ayounsi) Even more odd is that it flaps, even when no change is being done on the device. One mail it will remove it, another mail it will re-add it. Nothing... [06:53:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T413525)', diff saved to https://phabricator.wikimedia.org/P87253 and previous config saved to /var/cache/conftool/dbconfig/20260112-065325-marostegui.json [06:53:29] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:53:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:56:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1172.eqiad.wmnet with reason: Maintenance [06:56:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1172 (T413525)', diff saved to https://phabricator.wikimedia.org/P87254 and previous config saved to /var/cache/conftool/dbconfig/20260112-065646-marostegui.json [06:58:34] !log push pfw policy - T414116 [06:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T413525)', diff saved to https://phabricator.wikimedia.org/P87255 and previous config saved to /var/cache/conftool/dbconfig/20260112-070153-marostegui.json [07:01:57] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:05:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx) [07:07:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2152.codfw.wmnet with reason: Maintenance [07:07:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2152 (T413525)', diff saved to https://phabricator.wikimedia.org/P87256 and previous config saved to /var/cache/conftool/dbconfig/20260112-070724-marostegui.json [07:07:28] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:07:54] (03CR) 10Arnaudb: [C:03+2] mailman: add UpstreamTlsContext on tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/1219770 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [07:12:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P87257 and previous config saved to /var/cache/conftool/dbconfig/20260112-071201-marostegui.json [07:12:30] (03CR) 10Muehlenhoff: [C:03+2] Remove dead Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1224891 (owner: 10Muehlenhoff) [07:12:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T413525)', diff saved to https://phabricator.wikimedia.org/P87258 and previous config saved to /var/cache/conftool/dbconfig/20260112-071235-marostegui.json [07:12:39] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:12:47] (03CR) 10Stang: [C:03+1] zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx) [07:17:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219872 (owner: 10Muehlenhoff) [07:19:11] FIRING: [6x] ProbeDown: Service wdqs1024:443 has failed probes (http_wdqs_scholarly_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:20:33] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11511021 (10ABran-WMF) [07:22:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P87259 and previous config saved to /var/cache/conftool/dbconfig/20260112-072209-marostegui.json [07:22:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P87260 and previous config saved to /var/cache/conftool/dbconfig/20260112-072243-marostegui.json [07:23:28] (03PS3) 10Arnaudb: mailman: add lists to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1219151 (https://phabricator.wikimedia.org/T286066) [07:23:28] (03CR) 10Arnaudb: "this change adds lists.wm.o to service catalog, no private IP have been used here. We'll keep using the existing public one to send emails" [puppet] - 10https://gerrit.wikimedia.org/r/1219151 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [07:23:53] (03PS3) 10Arnaudb: mailman: update lists.wm.o backend mapping [puppet] - 10https://gerrit.wikimedia.org/r/1219062 (https://phabricator.wikimedia.org/T286066) [07:23:53] (03CR) 10Arnaudb: "similar to 1219151" [puppet] - 10https://gerrit.wikimedia.org/r/1219062 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [07:24:55] (03CR) 10Arnaudb: mailman: update lists.wm.o backend mapping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219062 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [07:31:21] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11511027 (10MoritzMuehlenhoff) [07:32:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T413525)', diff saved to https://phabricator.wikimedia.org/P87261 and previous config saved to /var/cache/conftool/dbconfig/20260112-073218-marostegui.json [07:32:22] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:32:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: Maintenance [07:32:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1177 (T413525)', diff saved to https://phabricator.wikimedia.org/P87262 and previous config saved to /var/cache/conftool/dbconfig/20260112-073242-marostegui.json [07:32:51] !log updated trixie installer image to 13.3 T414179 [07:32:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P87263 and previous config saved to /var/cache/conftool/dbconfig/20260112-073251-marostegui.json [07:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:54] T414179: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179 [07:37:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T413525)', diff saved to https://phabricator.wikimedia.org/P87264 and previous config saved to /var/cache/conftool/dbconfig/20260112-073748-marostegui.json [07:37:52] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:42:11] (03CR) 10Muehlenhoff: [C:03+2] KDC: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219872 (owner: 10Muehlenhoff) [07:43:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T413525)', diff saved to https://phabricator.wikimedia.org/P87265 and previous config saved to /var/cache/conftool/dbconfig/20260112-074300-marostegui.json [07:43:04] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:43:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2154.codfw.wmnet with reason: Maintenance [07:43:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2154 (T413525)', diff saved to https://phabricator.wikimedia.org/P87266 and previous config saved to /var/cache/conftool/dbconfig/20260112-074325-marostegui.json [07:47:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P87267 and previous config saved to /var/cache/conftool/dbconfig/20260112-074756-marostegui.json [07:48:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T413525)', diff saved to https://phabricator.wikimedia.org/P87268 and previous config saved to /var/cache/conftool/dbconfig/20260112-074822-marostegui.json [07:48:26] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:55:48] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11511082 (10MoritzMuehlenhoff) [07:58:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P87269 and previous config saved to /var/cache/conftool/dbconfig/20260112-075805-marostegui.json [07:58:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P87270 and previous config saved to /var/cache/conftool/dbconfig/20260112-075830-marostegui.json [08:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T0800) [08:00:05] sfaci and Nvdtn19: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:25] o/ [08:04:10] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:08:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T413525)', diff saved to https://phabricator.wikimedia.org/P87271 and previous config saved to /var/cache/conftool/dbconfig/20260112-080813-marostegui.json [08:08:17] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:08:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: Maintenance [08:08:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1178 (T413525)', diff saved to https://phabricator.wikimedia.org/P87272 and previous config saved to /var/cache/conftool/dbconfig/20260112-080827-marostegui.json [08:08:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P87273 and previous config saved to /var/cache/conftool/dbconfig/20260112-080839-marostegui.json [08:09:18] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11511100 (10MoritzMuehlenhoff) [08:13:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T413525)', diff saved to https://phabricator.wikimedia.org/P87274 and previous config saved to /var/cache/conftool/dbconfig/20260112-081336-marostegui.json [08:13:40] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:18:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T413525)', diff saved to https://phabricator.wikimedia.org/P87275 and previous config saved to /var/cache/conftool/dbconfig/20260112-081847-marostegui.json [08:18:51] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:19:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2163.codfw.wmnet with reason: Maintenance [08:19:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2163 (T413525)', diff saved to https://phabricator.wikimedia.org/P87276 and previous config saved to /var/cache/conftool/dbconfig/20260112-081912-marostegui.json [08:21:10] sfaci: sorry I just woke up. Can you self-serve? [08:21:45] Amir1: No problem! No, I can't. I need someone who is able to deploy [08:22:19] the change is scary. Give me a bit [08:22:28] so I can get more context [08:23:29] ok! no rush [08:23:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P87277 and previous config saved to /var/cache/conftool/dbconfig/20260112-082344-marostegui.json [08:24:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T413525)', diff saved to https://phabricator.wikimedia.org/P87278 and previous config saved to /var/cache/conftool/dbconfig/20260112-082409-marostegui.json [08:24:13] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:24:34] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11511112 (10kimpham) Just an update that I received and signed the NDA from @KFrancis [08:27:38] sfaci: I'd say let's give it a couple hours before enabling it on test wiki since testwiki is actually in production [08:27:49] but otherwise let me deploy it [08:28:43] do you mean a couple of hours between enabling in beta and in testwiki? [08:33:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P87279 and previous config saved to /var/cache/conftool/dbconfig/20260112-083353-marostegui.json [08:34:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P87280 and previous config saved to /var/cache/conftool/dbconfig/20260112-083418-marostegui.json [08:42:44] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11511124 (10ayounsi) > Is sretest2003 the only one that shows this behavior, or do we have others? I am particularly interested in if you were able to set the... [08:43:01] (03PS1) 10Elukey: docker_registry: set backend redirects for the various storages [puppet] - 10https://gerrit.wikimedia.org/r/1225466 (https://phabricator.wikimedia.org/T394476) [08:43:50] (03CR) 10DCausse: "For context, the code requiring this is:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse) [08:44:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T413525)', diff saved to https://phabricator.wikimedia.org/P87281 and previous config saved to /var/cache/conftool/dbconfig/20260112-084401-marostegui.json [08:44:06] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:44:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: Maintenance [08:44:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1192 (T413525)', diff saved to https://phabricator.wikimedia.org/P87282 and previous config saved to /var/cache/conftool/dbconfig/20260112-084425-marostegui.json [08:44:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P87283 and previous config saved to /var/cache/conftool/dbconfig/20260112-084426-marostegui.json [08:44:55] (03PS2) 10Elukey: docker_registry: set backend redirects for the various storages [puppet] - 10https://gerrit.wikimedia.org/r/1225466 (https://phabricator.wikimedia.org/T394476) [08:45:22] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225466 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [08:47:40] 06SRE, 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11511134 (10JMeybohm) >>! In T414187#11508155, @trueg wrote: > I am sorry, I do not know what this means: "Grafana access is granted by having an LDAP account." > Is the LDAP accou... [08:49:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T413525)', diff saved to https://phabricator.wikimedia.org/P87284 and previous config saved to /var/cache/conftool/dbconfig/20260112-084926-marostegui.json [08:49:30] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:53:39] (03CR) 10Elukey: docker_registry: set backend redirects for the various storages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1225466 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [08:54:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T413525)', diff saved to https://phabricator.wikimedia.org/P87285 and previous config saved to /var/cache/conftool/dbconfig/20260112-085434-marostegui.json [08:54:38] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:54:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2164.codfw.wmnet with reason: Maintenance [08:55:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2164 (T413525)', diff saved to https://phabricator.wikimedia.org/P87286 and previous config saved to /var/cache/conftool/dbconfig/20260112-085459-marostegui.json [08:56:21] (03PS1) 10Bartosz Wójtowicz: ml-services: Decrease minReplicas for revertrisk-multilingual to 3. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225474 (https://phabricator.wikimedia.org/T411786) [08:57:09] (03CR) 10Dpogorzelski: [C:03+1] ml-services: Decrease minReplicas for revertrisk-multilingual to 3. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225474 (https://phabricator.wikimedia.org/T411786) (owner: 10Bartosz Wójtowicz) [08:58:54] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Decrease minReplicas for revertrisk-multilingual to 3. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225474 (https://phabricator.wikimedia.org/T411786) (owner: 10Bartosz Wójtowicz) [08:58:56] (03CR) 10Muehlenhoff: [C:03+2] Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:59:23] !log eqiad pfw - remove old LVS BGP config (replaced by bird) - T414015 [08:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:27] T414015: Remove pfw configuration related to former pybal/LVS service - https://phabricator.wikimedia.org/T414015 [08:59:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P87287 and previous config saved to /var/cache/conftool/dbconfig/20260112-085934-marostegui.json [09:00:46] (03Merged) 10jenkins-bot: ml-services: Decrease minReplicas for revertrisk-multilingual to 3. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225474 (https://phabricator.wikimedia.org/T411786) (owner: 10Bartosz Wójtowicz) [09:01:13] (03CR) 10Muehlenhoff: [C:03+2] Rename puppetmaster::gitsync and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:01:25] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:01:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T413525)', diff saved to https://phabricator.wikimedia.org/P87288 and previous config saved to /var/cache/conftool/dbconfig/20260112-090158-marostegui.json [09:02:02] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:04:22] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:06:35] (03CR) 10Ladsgroup: [C:03+1] Stop updating Deadendpages and Lonelypages on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225118 (https://phabricator.wikimedia.org/T371662) (owner: 10Zabe) [09:07:18] (03CR) 10Ladsgroup: [C:03+1] "Spicy idea: Can we just delete the code altogether? I'm not aware of any third party use of the extension." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225088 (https://phabricator.wikimedia.org/T414202) (owner: 10Zabe) [09:07:20] (03CR) 10JMeybohm: [C:03+1] "Nice find! Change lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1225466 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [09:07:47] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Karai-Karai | https://iso639-3.sil.org/code/kai" [dns] - 10https://gerrit.wikimedia.org/r/1225036 (https://phabricator.wikimedia.org/T414234) (owner: 10Gerrit maintenance bot) [09:09:20] (03CR) 10Dzahn: [V:03+1 C:03+2] Add kai to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1225036 (https://phabricator.wikimedia.org/T414234) (owner: 10Gerrit maintenance bot) [09:09:27] !log dzahn@dns1004 START - running authdns-update [09:09:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P87289 and previous config saved to /var/cache/conftool/dbconfig/20260112-090942-marostegui.json [09:10:11] !log DNS - adding new language code 'kai' - https://en.wikipedia.org/wiki/Karai-karai [09:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:55] !log dzahn@dns1004 END - running authdns-update [09:12:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P87290 and previous config saved to /var/cache/conftool/dbconfig/20260112-091206-marostegui.json [09:13:46] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:13:56] RECOVERY - Host titan1002 is UP: PING WARNING - Packet loss = 0%, RTA = 780.87 ms [09:16:20] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3136.60 ms [09:18:32] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:18:54] (03CR) 10Vgutierrez: "looks good overall, please check inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1224897 (https://phabricator.wikimedia.org/T414111) (owner: 10Slyngshede) [09:19:10] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:10] FIRING: JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:19:23] (03CR) 10Dzahn: [C:03+1] mailman: update lists.wm.o backend mapping [puppet] - 10https://gerrit.wikimedia.org/r/1219062 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [09:19:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T413525)', diff saved to https://phabricator.wikimedia.org/P87291 and previous config saved to /var/cache/conftool/dbconfig/20260112-091950-marostegui.json [09:19:54] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:20:07] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:20:07] RESOLVED: JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:20:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1193.eqiad.wmnet with reason: Maintenance [09:20:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1193 (T413525)', diff saved to https://phabricator.wikimedia.org/P87292 and previous config saved to /var/cache/conftool/dbconfig/20260112-092016-marostegui.json [09:22:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P87293 and previous config saved to /var/cache/conftool/dbconfig/20260112-092215-marostegui.json [09:22:53] 06SRE, 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11511186 (10trueg) Then maybe I am doing it wrong, because when I try to login on grafana.mediawiki.org via my dev account I get "Service access denied due to missing privileges." [09:23:04] (03CR) 10Cathal Mooney: [C:03+1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [09:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:25:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T413525)', diff saved to https://phabricator.wikimedia.org/P87294 and previous config saved to /var/cache/conftool/dbconfig/20260112-092520-marostegui.json [09:25:25] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:25:55] 06SRE, 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11511202 (10Dzahn) @trueg Yes, dev account is LDAP account. What exact user name are you trying to use to login? How about "Trueg" vs. "trueg" ? (Do not use "STrug-WMF"). [09:26:05] (03PS5) 10Vgutierrez: cache::haproxy: Get rid of http-request after use_backend warning [puppet] - 10https://gerrit.wikimedia.org/r/1215119 [09:26:25] 06SRE, 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11511203 (10Dzahn) 05Resolved→03Open [09:27:08] (03PS2) 10Arnaudb: mailman: record update for lists.wm.o [dns] - 10https://gerrit.wikimedia.org/r/1219061 (https://phabricator.wikimedia.org/T286066) [09:27:08] (03CR) 10Arnaudb: "matching change for 1219062" [dns] - 10https://gerrit.wikimedia.org/r/1219061 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [09:27:51] (03CR) 10Ayounsi: [C:03+2] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [09:28:42] (03CR) 10Dzahn: [C:03+1] mailman: record update for lists.wm.o [dns] - 10https://gerrit.wikimedia.org/r/1219061 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [09:29:58] (03CR) 10D3r1ck01: [C:03+1] "I think this can be deployed. Wikis are on `wmf.10`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214584 (https://phabricator.wikimedia.org/T410908) (owner: 10Cparle) [09:32:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T413525)', diff saved to https://phabricator.wikimedia.org/P87295 and previous config saved to /var/cache/conftool/dbconfig/20260112-093223-marostegui.json [09:32:27] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:32:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2165.codfw.wmnet with reason: Maintenance [09:32:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2165 (T413525)', diff saved to https://phabricator.wikimedia.org/P87296 and previous config saved to /var/cache/conftool/dbconfig/20260112-093237-marostegui.json [09:34:24] 06SRE, 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11511254 (10Dzahn) a:05JMeybohm→03None [09:35:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P87297 and previous config saved to /var/cache/conftool/dbconfig/20260112-093528-marostegui.json [09:37:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T413525)', diff saved to https://phabricator.wikimedia.org/P87298 and previous config saved to /var/cache/conftool/dbconfig/20260112-093731-marostegui.json [09:37:35] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:38:59] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11511277 (10Dzahn) a:03KOfori [09:40:04] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11511301 (10Dzahn) a:03gmodena [09:42:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11511319 (10Dzahn) Hi @KReid-WMF Do you only need access to dashboards (without specifically private data) and that's it? Could you take a look at https://wiki... [09:42:37] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11511320 (10Dzahn) a:03KReid-WMF [09:43:09] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11511332 (10Dzahn) a:03SEgt-WMF [09:43:32] (03CR) 10Dreamy Jazz: [C:03+1] Stop setting $wgBlockTargetMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225165 (https://phabricator.wikimedia.org/T355034) (owner: 10Zabe) [09:43:42] (03Merged) 10jenkins-bot: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [09:43:44] jouncebot: nowandnext [09:43:44] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [09:43:44] In 1 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T1100) [09:44:26] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11511337 (10gmodena) Hey @JMeybohm, >>! In T414192#11507408, @JMeybohm wrote: > @trueg could you please specify what access level you're requesting/what you need access to (see https://wikitec... [09:44:54] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Get rid of http-request after use_backend warning [puppet] - 10https://gerrit.wikimedia.org/r/1215119 (owner: 10Vgutierrez) [09:45:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P87299 and previous config saved to /var/cache/conftool/dbconfig/20260112-094537-marostegui.json [09:47:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2249 slowly with 10 steps - repooling [09:47:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P87301 and previous config saved to /var/cache/conftool/dbconfig/20260112-094740-marostegui.json [09:55:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T413525)', diff saved to https://phabricator.wikimedia.org/P87302 and previous config saved to /var/cache/conftool/dbconfig/20260112-095545-marostegui.json [09:55:49] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:56:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1203.eqiad.wmnet with reason: Maintenance [09:56:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1203 (T413525)', diff saved to https://phabricator.wikimedia.org/P87303 and previous config saved to /var/cache/conftool/dbconfig/20260112-095610-marostegui.json [09:57:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P87304 and previous config saved to /var/cache/conftool/dbconfig/20260112-095748-marostegui.json [10:00:09] (03CR) 10Muehlenhoff: [C:03+2] Remove Puppet 5 settings from late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:01:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T413525)', diff saved to https://phabricator.wikimedia.org/P87305 and previous config saved to /var/cache/conftool/dbconfig/20260112-100112-marostegui.json [10:01:18] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:03:30] (03PS5) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) [10:04:12] (03PS6) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) [10:07:52] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [10:07:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T413525)', diff saved to https://phabricator.wikimedia.org/P87307 and previous config saved to /var/cache/conftool/dbconfig/20260112-100756-marostegui.json [10:08:00] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:08:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2166.codfw.wmnet with reason: Maintenance [10:08:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T413525)', diff saved to https://phabricator.wikimedia.org/P87308 and previous config saved to /var/cache/conftool/dbconfig/20260112-100821-marostegui.json [10:10:33] (03PS2) 10Muehlenhoff: puppet: Remove the force_puppet7 parameter [puppet] - 10https://gerrit.wikimedia.org/r/1224605 (https://phabricator.wikimedia.org/T365798) [10:11:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1203 gradually with 4 steps - repooling [10:12:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224605 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:14:41] (03CR) 10Elukey: [C:03+1] "Looks really good, thanks for the explanation!" [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451) (owner: 10JHathaway) [10:15:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T413525)', diff saved to https://phabricator.wikimedia.org/P87310 and previous config saved to /var/cache/conftool/dbconfig/20260112-101520-marostegui.json [10:15:25] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:17:28] (03PS7) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) [10:17:34] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [10:18:02] (03PS8) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) [10:18:07] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [10:18:25] !log fetch HAProxy 2.8.18 on thirdparty/haproxy28-bullseye (apt.wm.o) - T414318 [10:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:28] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [10:19:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2166 gradually with 4 steps - repooling [10:20:13] (03CR) 10Elukey: [C:03+2] docker_registry: set backend redirects for the various storages [puppet] - 10https://gerrit.wikimedia.org/r/1225466 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [10:20:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:20:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:20:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1169 (T413525)', diff saved to https://phabricator.wikimedia.org/P87313 and previous config saved to /var/cache/conftool/dbconfig/20260112-102044-marostegui.json [10:20:48] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:22:08] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[7008,7016].*} and A:cp - test haproxy 2.8.18 upgrade (T414318) [10:23:21] (03PS9) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) [10:23:30] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [10:23:37] 06SRE, 06Infrastructure-Foundations, 10netops: Offline script - adjust to work with fundraising - https://phabricator.wikimedia.org/T414321 (10cmooney) 03NEW p:05Triage→03Medium [10:23:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451) (owner: 10JHathaway) [10:25:45] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: changed REST sandbox rerouting to redirection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224838 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [10:26:20] (03CR) 10Muehlenhoff: [C:03+2] puppet: Remove the force_puppet7 parameter [puppet] - 10https://gerrit.wikimedia.org/r/1224605 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:26:44] (03PS1) 10Federico Ceratto: mariadb: Send IRC notifications for GTID issues [alerts] - 10https://gerrit.wikimedia.org/r/1225499 (https://phabricator.wikimedia.org/T315642) [10:27:41] (03CR) 10Marostegui: [C:03+1] mariadb: Send IRC notifications for GTID issues [alerts] - 10https://gerrit.wikimedia.org/r/1225499 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [10:28:13] (03CR) 10CI reject: [V:04-1] mariadb: Send IRC notifications for GTID issues [alerts] - 10https://gerrit.wikimedia.org/r/1225499 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [10:29:05] (03CR) 10Elukey: [C:03+2] debian installer: format EFI partions [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451) (owner: 10JHathaway) [10:32:40] (03PS2) 10Federico Ceratto: mariadb: Send IRC notifications for GTID issues [alerts] - 10https://gerrit.wikimedia.org/r/1225499 (https://phabricator.wikimedia.org/T315642) [10:33:16] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/1225502 (https://phabricator.wikimedia.org/T365798) [10:34:25] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[7008,7016].*} and A:cp - test haproxy 2.8.18 upgrade (T414318) [10:34:27] (03PS3) 10Slyngshede: P:cache::haproxy: check existance of mmdb files [puppet] - 10https://gerrit.wikimedia.org/r/1224897 (https://phabricator.wikimedia.org/T414111) [10:34:29] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [10:35:58] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Send IRC notifications for GTID issues [alerts] - 10https://gerrit.wikimedia.org/r/1225499 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [10:37:35] (03Merged) 10jenkins-bot: mariadb: Send IRC notifications for GTID issues [alerts] - 10https://gerrit.wikimedia.org/r/1225499 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [10:41:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1203 gradually with 4 steps - repooling [10:44:34] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:47:20] (03CR) 10Clément Goubert: "This is for access inside the network so that kubernetes ingress works correctly." [dns] - 10https://gerrit.wikimedia.org/r/1224652 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [10:48:28] (03CR) 10Clément Goubert: [C:03+2] aux-k8s: Add redioscope namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224644 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [10:48:31] (03CR) 10Clément Goubert: [C:03+2] deploy: Add redioscope kubeconfig to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1224643 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [10:49:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance [10:49:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T413525)', diff saved to https://phabricator.wikimedia.org/P87321 and previous config saved to /var/cache/conftool/dbconfig/20260112-104941-marostegui.json [10:49:45] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:49:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2166 gradually with 4 steps - repooling [10:51:19] (03PS1) 10Blake: switchdc: Delete services cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/1225500 [10:51:50] (03PS1) 10Gehel: blazegraph: relax categories update lag alert [alerts] - 10https://gerrit.wikimedia.org/r/1225507 [10:52:44] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11511679 (10Xqt) 05Resolved→03Open This is not solved yet for Pywikibot tests. A significant number of tests are still failing, and I have not been able to fin... [10:53:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T413525)', diff saved to https://phabricator.wikimedia.org/P87323 and previous config saved to /var/cache/conftool/dbconfig/20260112-105316-marostegui.json [10:55:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [10:55:31] OMFG I completely forgot I was doing a deploy and left it on "testservers" for two hours [10:55:40] I'm sorry [10:56:17] the problem with tabs, they get lost, having a terminal open at least remind me something is happening. I need to figure out how not to make this mistake again [10:56:21] (03PS2) 10Muehlenhoff: beta::mediawiki_packages: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1223677 [10:56:27] (03PS1) 10Muehlenhoff: Remove obsolete site.pp entry for bast7001 [puppet] - 10https://gerrit.wikimedia.org/r/1225513 [10:56:27] (03Merged) 10jenkins-bot: aux-k8s: Add redioscope namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224644 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [10:57:23] (03Merged) 10jenkins-bot: extension-list: Add Test Kitchen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [10:57:23] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/1225502 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:57:33] !log urbanecm@deploy2002 mwscript-k8s job started: CentralAuth:emptyGlobalUserGroup.php --wiki=metawiki oathauth-tester # T411360 [10:57:36] T411360: cleanup - depopuplate global oathauth-tester group - https://phabricator.wikimedia.org/T411360 [10:58:06] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1216847|extension-list: Add Test Kitchen (T407806)]] [10:58:09] T407806: Rename Metrics Platform Extension to Test Kitchen - https://phabricator.wikimedia.org/T407806 [10:58:35] (03PS2) 10Blake: switchdc: Delete services cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/1225500 (https://phabricator.wikimedia.org/T412211) [10:59:45] Amir1: tsk tsk tsk :P [11:00:00] I did wonder why my security deployment just was not running [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T1100) [11:00:19] Without any provided logs [11:00:30] There's a couple potential options, one would be to have a similar mechanism as we have for the cookbooks waiting for input [11:00:42] (ping the username on irc) [11:01:18] !log Deployed security patch for T414011 [11:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P87325 and previous config saved to /var/cache/conftool/dbconfig/20260112-110324-marostegui.json [11:04:41] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11511748 (10Xqt) [11:06:51] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:07:34] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11511756 (10Xqt) >>! In T414173#11509876, @Benwing2 wrote: > i dunno why pywikibot is having issues with retry-after or why it's ending up as a float. My bot has... [11:07:48] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:08:12] !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [11:08:38] !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:09:12] !log uploaded dnsmasq 2.92-rc3 to bookworm-wikimedia/main T396864 [11:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:15] T396864: Routed Ganeti: same node DHCP limitation - https://phabricator.wikimedia.org/T396864 [11:09:52] (03CR) 10Clément Goubert: [C:03+1] beta::mediawiki_packages: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1223677 (owner: 10Muehlenhoff) [11:11:09] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [11:11:22] (03CR) 10Muehlenhoff: [C:03+2] beta::mediawiki_packages: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1223677 (owner: 10Muehlenhoff) [11:11:51] Amir1: no problem! I'm still here [11:13:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P87326 and previous config saved to /var/cache/conftool/dbconfig/20260112-111332-marostegui.json [11:16:02] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225519 (https://phabricator.wikimedia.org/T365798) [11:19:11] FIRING: [6x] ProbeDown: Service wdqs1024:443 has failed probes (http_wdqs_scholarly_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:19:38] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from swift/ceph hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225522 (https://phabricator.wikimedia.org/T365798) [11:22:19] !log ladsgroup@deploy2002 cjming, ladsgroup: Backport for [[gerrit:1216847|extension-list: Add Test Kitchen (T407806)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:22:22] T407806: Rename Metrics Platform Extension to Test Kitchen - https://phabricator.wikimedia.org/T407806 [11:23:34] !log ladsgroup@deploy2002 cjming, ladsgroup: Continuing with sync [11:23:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T413525)', diff saved to https://phabricator.wikimedia.org/P87328 and previous config saved to /var/cache/conftool/dbconfig/20260112-112341-marostegui.json [11:23:45] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [11:23:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [11:24:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1184 (T413525)', diff saved to https://phabricator.wikimedia.org/P87329 and previous config saved to /var/cache/conftool/dbconfig/20260112-112405-marostegui.json [11:33:57] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-test-druid1001.eqiad.wmnet with reason: Testing druid upgrade [11:34:27] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete site.pp entry for bast7001 [puppet] - 10https://gerrit.wikimedia.org/r/1225513 (owner: 10Muehlenhoff) [11:35:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T413525)', diff saved to https://phabricator.wikimedia.org/P87331 and previous config saved to /var/cache/conftool/dbconfig/20260112-113541-marostegui.json [11:35:46] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [11:36:12] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216847|extension-list: Add Test Kitchen (T407806)]] (duration: 38m 06s) [11:36:16] T407806: Rename Metrics Platform Extension to Test Kitchen - https://phabricator.wikimedia.org/T407806 [11:36:23] (03CR) 10MVernon: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1225522 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:37:16] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11511867 (10elukey) Adding an IRC conversation between me and Matthew about DC replication: ` Emperor: o/ morning :) I have been testing the apus S3 api v... [11:38:49] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from traffic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225524 (https://phabricator.wikimedia.org/T365798) [11:40:21] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225525 (https://phabricator.wikimedia.org/T365798) [11:43:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:59] (03PS1) 10Elukey: profile::docker_registry: turn off backend redirects for Swift [puppet] - 10https://gerrit.wikimedia.org/r/1225526 (https://phabricator.wikimedia.org/T390251) [11:44:25] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225526 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [11:45:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P87332 and previous config saved to /var/cache/conftool/dbconfig/20260112-114549-marostegui.json [11:48:43] Amir1: are you able to deploy also the next one? The one that enables the extension for Beta Cluster. If I haven't understood wrongly, the testwiki one is the one you wanted to delay for a couple of hours, right? [11:52:02] (03CR) 10Hnowlan: [C:03+1] wmnet: Add redioscope CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1224652 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [11:52:34] (03PS1) 10AikoChou: ml-services: Update image and add client timeout for revise-tone-task-generator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225529 (https://phabricator.wikimedia.org/T412210) [11:53:26] (03CR) 10Clément Goubert: [C:03+2] wmnet: Add redioscope CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1224652 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [11:53:27] (03PS1) 10Muehlenhoff: Remove obsolete failoid entry from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1225530 [11:54:03] !log cgoubert@dns1004 START - running authdns-update [11:54:26] (03PS1) 10Kevin Bazira: ml-services: update rr-wikidata model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225531 (https://phabricator.wikimedia.org/T414060) [11:55:09] !log cgoubert@dns1004 END - running authdns-update [11:55:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P87334 and previous config saved to /var/cache/conftool/dbconfig/20260112-115557-marostegui.json [11:57:59] (03PS2) 10Kevin Bazira: ml-services: update rr-wikidata model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225531 (https://phabricator.wikimedia.org/T414060) [12:00:50] (03CR) 10JMeybohm: [C:03+1] profile::docker_registry: turn off backend redirects for Swift [puppet] - 10https://gerrit.wikimedia.org/r/1225526 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [12:02:44] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11511958 (10JMeybohm) a:05gmodena→03DSantamaria [12:03:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2249 slowly with 10 steps - repooling [12:03:50] 06SRE, 10SRE-Access-Requests, 06Release-Engineering-Team: Add yubikey ssh key for dancy - https://phabricator.wikimedia.org/T414032#11511963 (10JMeybohm) a:03dancy [12:04:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:04:26] (03CR) 10Jcrespo: [C:03+1] Remove profile::puppet::agent::force_puppet7 from backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225519 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:04:36] (03CR) 10Gkyziridis: [C:03+1] "THNX for deploying." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225531 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [12:05:07] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update rr-wikidata model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225531 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [12:05:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T413525)', diff saved to https://phabricator.wikimedia.org/P87336 and previous config saved to /var/cache/conftool/dbconfig/20260112-120509-marostegui.json [12:05:13] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [12:06:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T413525)', diff saved to https://phabricator.wikimedia.org/P87337 and previous config saved to /var/cache/conftool/dbconfig/20260112-120606-marostegui.json [12:06:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance [12:06:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T413525)', diff saved to https://phabricator.wikimedia.org/P87338 and previous config saved to /var/cache/conftool/dbconfig/20260112-120631-marostegui.json [12:06:38] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete failoid entry from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1225530 (owner: 10Muehlenhoff) [12:07:19] (03Merged) 10jenkins-bot: ml-services: update rr-wikidata model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225531 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [12:08:31] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:09:45] (03PS1) 10Muehlenhoff: Add Sukhbir as approver for varnish-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/1225534 (https://phabricator.wikimedia.org/T276465) [12:11:32] (03PS1) 10Giuseppe Lavagetto: cache:haproxy: Add Lua-based contact info extraction [puppet] - 10https://gerrit.wikimedia.org/r/1225536 (https://phabricator.wikimedia.org/T414300) [12:11:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11511996 (10cmooney) [12:11:51] 06SRE, 06Infrastructure-Foundations, 10netops: Offline script - adjust to work with fundraising - https://phabricator.wikimedia.org/T414321#11511995 (10cmooney) [12:15:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P87339 and previous config saved to /var/cache/conftool/dbconfig/20260112-121517-marostegui.json [12:17:30] !log revoked legacy default-staging-certificate certificate T365798 [12:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:34] T365798: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798 [12:19:17] !log revoked legacy ganeti02.svc.esams certificate T365798 [12:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:31] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1225538 (owner: 10L10n-bot) [12:24:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:25:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P87340 and previous config saved to /var/cache/conftool/dbconfig/20260112-122525-marostegui.json [12:26:38] !log revoked legacy linkrecommendation discovery certificate T365798 [12:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:41] T365798: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798 [12:28:45] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225529 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou) [12:34:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:35:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T413525)', diff saved to https://phabricator.wikimedia.org/P87341 and previous config saved to /var/cache/conftool/dbconfig/20260112-123533-marostegui.json [12:35:38] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [12:35:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [12:35:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1186 (T413525)', diff saved to https://phabricator.wikimedia.org/P87342 and previous config saved to /var/cache/conftool/dbconfig/20260112-123546-marostegui.json [12:39:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T413525)', diff saved to https://phabricator.wikimedia.org/P87343 and previous config saved to /var/cache/conftool/dbconfig/20260112-123918-marostegui.json [12:41:41] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:42:02] (03CR) 10AikoChou: [C:03+2] "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225529 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou) [12:42:32] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Improve database master switchover script - https://phabricator.wikimedia.org/T200306#11512075 (10Marostegui) →14Duplicate dup:03T409926 [12:42:46] (03PS1) 10Clément Goubert: api-gateway: Add external services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225548 (https://phabricator.wikimedia.org/T414333) [12:43:30] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225519 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:43:50] (03Merged) 10jenkins-bot: ml-services: Update image and add client timeout for revise-tone-task-generator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225529 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou) [12:48:34] (03PS1) 10Gkyziridis: ml-services: Remove revertrisk-wikidata from revesion-models ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225553 (https://phabricator.wikimedia.org/T406179) [12:49:01] (03CR) 10Hnowlan: [C:03+1] "Makes sense to me, one nit, one question. I would like for the traffic team to see this before merging if possible." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [12:49:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P87344 and previous config saved to /var/cache/conftool/dbconfig/20260112-124926-marostegui.json [12:52:23] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Remove revertrisk-wikidata from revesion-models ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225553 (https://phabricator.wikimedia.org/T406179) (owner: 10Gkyziridis) [12:54:34] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:54:52] (03CR) 10Gkyziridis: [C:03+2] ml-services: Remove revertrisk-wikidata from revesion-models ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225553 (https://phabricator.wikimedia.org/T406179) (owner: 10Gkyziridis) [12:56:27] 06SRE, 10Scap, 06serviceops, 07Datacenter-Switchover: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11512111 (10Clement_Goubert) a:03Blake [12:56:38] (03Merged) 10jenkins-bot: ml-services: Remove revertrisk-wikidata from revesion-models ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225553 (https://phabricator.wikimedia.org/T406179) (owner: 10Gkyziridis) [12:57:17] (03CR) 10Marostegui: [C:03+1] "This can be shipped" [puppet] - 10https://gerrit.wikimedia.org/r/1217492 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [12:58:18] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from swift/ceph hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225522 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:58:53] !log gkyziridis@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:59:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P87345 and previous config saved to /var/cache/conftool/dbconfig/20260112-125934-marostegui.json [13:00:41] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225556 [13:03:04] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2240.codfw.wmnet with reason: Maintenance [13:07:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2240 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87346 and previous config saved to /var/cache/conftool/dbconfig/20260112-130754-marostegui.json [13:07:59] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [13:08:00] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [13:08:47] (03CR) 10Federico Ceratto: "That's x1 - the cookbook is for:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [13:09:43] (03CR) 10Federico Ceratto: [C:03+2] prometheus-mariadb-replication-lag.py: mysql_heartbeat_lag_seconds metric [puppet] - 10https://gerrit.wikimedia.org/r/1217492 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [13:09:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T413525)', diff saved to https://phabricator.wikimedia.org/P87347 and previous config saved to /var/cache/conftool/dbconfig/20260112-130943-marostegui.json [13:09:47] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:09:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance [13:09:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T413525)', diff saved to https://phabricator.wikimedia.org/P87348 and previous config saved to /var/cache/conftool/dbconfig/20260112-130952-marostegui.json [13:10:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T413525)', diff saved to https://phabricator.wikimedia.org/P87349 and previous config saved to /var/cache/conftool/dbconfig/20260112-131003-marostegui.json [13:12:03] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [13:17:55] (03CR) 10Muehlenhoff: [C:03+2] durum: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224704 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [13:19:23] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [13:20:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P87350 and previous config saved to /var/cache/conftool/dbconfig/20260112-132000-marostegui.json [13:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:30:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P87351 and previous config saved to /var/cache/conftool/dbconfig/20260112-133008-marostegui.json [13:33:12] (03PS1) 10Ayounsi: Routed ganeti: ensure dnsmasq is installed before being used [puppet] - 10https://gerrit.wikimedia.org/r/1225561 (https://phabricator.wikimedia.org/T396864) [13:33:30] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225561 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [13:34:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1225561 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [13:36:25] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:50] (03CR) 10Ayounsi: [C:03+2] Routed ganeti: ensure dnsmasq is installed before being used [puppet] - 10https://gerrit.wikimedia.org/r/1225561 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [13:38:19] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-test-druid1001.eqiad.wmnet with OS bookworm [13:39:12] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-druid1001.eqiad.wmnet with OS bookworm [13:39:57] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [13:40:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T413525)', diff saved to https://phabricator.wikimedia.org/P87352 and previous config saved to /var/cache/conftool/dbconfig/20260112-134016-marostegui.json [13:40:20] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:40:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1195.eqiad.wmnet with reason: Maintenance [13:40:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1195 (T413525)', diff saved to https://phabricator.wikimedia.org/P87353 and previous config saved to /var/cache/conftool/dbconfig/20260112-134040-marostegui.json [13:41:51] (03CR) 10Marostegui: "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [13:44:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T413525)', diff saved to https://phabricator.wikimedia.org/P87354 and previous config saved to /var/cache/conftool/dbconfig/20260112-134404-marostegui.json [13:46:25] RESOLVED: SystemdUnitFailed: dnsmasq.service on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:56] FIRING: MaxConntrack: Max conntrack at 100% on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [13:52:56] RESOLVED: MaxConntrack: Max conntrack at 100% on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [13:54:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P87355 and previous config saved to /var/cache/conftool/dbconfig/20260112-135413-marostegui.json [13:56:56] FIRING: MaxConntrack: Max conntrack at 99.99% on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [13:58:25] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [13:59:11] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2007.codfw.wmnet with OS bookworm [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T1400). [14:00:05] cscott and AaronSchulz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] o/ [14:00:16] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2007.codfw.wmnet with OS bookworm [14:00:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:54] I can deploy if needed, though I guess both of you can also self-deploy :) [14:01:56] RESOLVED: MaxConntrack: Max conntrack at 92.03% on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:02:35] o/ [14:02:41] i can self-deploy, too. [14:03:01] shall i get started? [14:03:07] yeah, you’re first in line ^^ [14:03:16] i'm going to do both at once [14:03:21] they're both config changes [14:03:29] ack [14:04:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P87356 and previous config saved to /var/cache/conftool/dbconfig/20260112-140421-marostegui.json [14:04:35] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [14:05:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224719 (https://phabricator.wikimedia.org/T413108) (owner: 10C. Scott Ananian) [14:05:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224169 (https://phabricator.wikimedia.org/T414019) (owner: 10C. Scott Ananian) [14:05:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:17] (03Merged) 10jenkins-bot: Increase PRV percentage on fawiki/kowiki/azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224719 (https://phabricator.wikimedia.org/T413108) (owner: 10C. Scott Ananian) [14:06:20] (03Merged) 10jenkins-bot: Turn off magic ISBN/RFC/PMID links on iawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224169 (https://phabricator.wikimedia.org/T414019) (owner: 10C. Scott Ananian) [14:06:41] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1224719|Increase PRV percentage on fawiki/kowiki/azwiki (T413108)]], [[gerrit:1224169|Turn off magic ISBN/RFC/PMID links on iawiki (T414019)]] [14:06:46] T413108: Parsoid Read Views to deploy ~2026-01-01 - https://phabricator.wikimedia.org/T413108 [14:06:46] T414019: Turn off ISBN/RFC/PMID magic links on iawiki - https://phabricator.wikimedia.org/T414019 [14:10:30] !log cscott@deploy2002 cscott: Backport for [[gerrit:1224719|Increase PRV percentage on fawiki/kowiki/azwiki (T413108)]], [[gerrit:1224169|Turn off magic ISBN/RFC/PMID links on iawiki (T414019)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:11:00] testing [14:14:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T413525)', diff saved to https://phabricator.wikimedia.org/P87357 and previous config saved to /var/cache/conftool/dbconfig/20260112-141429-marostegui.json [14:14:34] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:14:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [14:14:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T413525)', diff saved to https://phabricator.wikimedia.org/P87358 and previous config saved to /var/cache/conftool/dbconfig/20260112-141443-marostegui.json [14:14:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T413525)', diff saved to https://phabricator.wikimedia.org/P87359 and previous config saved to /var/cache/conftool/dbconfig/20260112-141454-marostegui.json [14:17:10] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:14] !log cscott@deploy2002 cscott: Continuing with sync [14:18:20] ok, looks good [14:19:11] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [14:19:14] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219062 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [14:20:26] o_O what on earth is this ParseError in logspam-watch: (/srv/parsoid-testing/src/Parsoid.php:168) syntax error, unexpected token "=" [14:21:33] must be some custom parsoidtest1001 stuff I guess https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/bdda3794a4/wmf-config/CommonSettings.php#5033 [14:21:55] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:15] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224719|Increase PRV percentage on fawiki/kowiki/azwiki (T413108)]], [[gerrit:1224169|Turn off magic ISBN/RFC/PMID links on iawiki (T414019)]] (duration: 15m 34s) [14:22:21] T413108: Parsoid Read Views to deploy ~2026-01-01 - https://phabricator.wikimedia.org/T413108 [14:22:21] T414019: Turn off ISBN/RFC/PMID magic links on iawiki - https://phabricator.wikimedia.org/T414019 [14:22:51] over to AaronSchulz? [14:22:54] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1029.eqiad.wmnet with OS trixie [14:23:02] Lucas_WMDE: (yes, subbu was live-debugging a tricky bug on parsoidtest1001 over the weekend) [14:23:07] ok [14:23:07] over to AaronSchulz , yep [14:23:11] thanks ^^ [14:23:19] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube1335-59 servers - jclark@cumin1003" [14:23:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube1335-59 servers - jclark@cumin1003" [14:23:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:24:11] the logs from parsoid-test *should* be in a different channel from ordinary production logs, because it's also where our weekly pre-deploy testing occurs. [14:25:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P87360 and previous config saved to /var/cache/conftool/dbconfig/20260112-142502-marostegui.json [14:25:27] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [14:26:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [14:26:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1247 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87361 and previous config saved to /var/cache/conftool/dbconfig/20260112-142642-marostegui.json [14:26:48] cscott: you’re right, I can’t see it in logstash. I guess logspam-watch reads slightly different sources [14:26:48] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:26:48] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:28:12] Lucas_WMDE: ok, just checking, because we've gotten those logs misconfigured before, so i wanted to make sure that didn't regress -- when our testing logs go to main or /dev/null we can potentially deploy a version of parsoid that logspams (although it presumably would have passed our other testing, so it wouldn't be *broken* necessarily) so it's always nice to check that logging is still working. [14:28:54] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube1335-59 servers - jclark@cumin1003" [14:28:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube1335-59 servers - jclark@cumin1003" [14:28:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:29:58] 06SRE, 06Infrastructure-Foundations: Avoid dhcpcd-base on trixie hosts - https://phabricator.wikimedia.org/T414341 (10MoritzMuehlenhoff) 03NEW [14:31:14] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1335 [14:31:33] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1335 [14:31:38] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1336 [14:31:51] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1336 [14:31:56] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1337 [14:32:09] (03PS1) 10Btullis: Add a second Yubikey SSH key to the account for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1225573 (https://phabricator.wikimedia.org/T409279) [14:32:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1337 [14:32:19] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1338 [14:32:32] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1338 [14:32:36] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1339 [14:32:41] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1339 [14:32:53] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1340 [14:32:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1340 [14:33:06] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1341 [14:33:07] (03PS4) 10Arnaudb: mailman: update lists.wm.o backend mapping [puppet] - 10https://gerrit.wikimedia.org/r/1219062 (https://phabricator.wikimedia.org/T286066) [14:33:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1341 [14:33:14] (03PS4) 10Arnaudb: mailman: add lists to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1219151 (https://phabricator.wikimedia.org/T286066) [14:33:16] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1342 [14:33:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1342 [14:33:26] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1343 [14:33:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1343 [14:33:50] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11512448 (10MoritzMuehlenhoff) [14:34:14] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie [14:34:24] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1343 [14:34:30] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1343 [14:34:34] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1344 [14:34:41] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1344 [14:34:46] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1345 [14:34:52] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1345 [14:35:01] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1346 [14:35:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P87362 and previous config saved to /var/cache/conftool/dbconfig/20260112-143510-marostegui.json [14:35:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1346 [14:35:32] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1350 [14:35:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1350 [14:35:49] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1351 [14:35:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1351 [14:36:02] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1352 [14:36:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1352 [14:36:13] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1353 [14:36:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1353 [14:36:25] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1354 [14:36:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1354 [14:36:38] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1355 [14:36:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1355 [14:36:50] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1356 [14:36:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1356 [14:37:02] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1357 [14:37:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1357 [14:37:15] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1358 [14:37:25] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1358 [14:37:30] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1358 [14:37:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1358 [14:37:42] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1359 [14:37:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1359 [14:40:04] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225576 (https://phabricator.wikimedia.org/T365798) [14:41:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11512481 (10Jclark-ctr) [14:42:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11512484 (10Jclark-ctr) All servers racked and netbox configured holding for update on 4ft power cables [14:43:42] !log zabe@deploy2002 mwscript-k8s job started: foreachwikiindblist all refreshImageMetadata.php --mediatype AUDIO --mime unknown/flac --force # T414259 [14:43:45] T414259: MP3 and flac files with wrong MIME type on Commons - https://phabricator.wikimedia.org/T414259 [14:45:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T413525)', diff saved to https://phabricator.wikimedia.org/P87363 and previous config saved to /var/cache/conftool/dbconfig/20260112-144518-marostegui.json [14:45:23] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:45:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance [14:45:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:46:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1196 (T413525)', diff saved to https://phabricator.wikimedia.org/P87364 and previous config saved to /var/cache/conftool/dbconfig/20260112-144602-marostegui.json [14:48:01] (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.11.1 [software/homer] - 10https://gerrit.wikimedia.org/r/1225579 [14:48:07] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on kafka-main1008:9290 - https://phabricator.wikimedia.org/T414344 (10phaultfinder) 03NEW [14:48:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T413525)', diff saved to https://phabricator.wikimedia.org/P87365 and previous config saved to /var/cache/conftool/dbconfig/20260112-144843-marostegui.json [14:48:49] (03Abandoned) 10Btullis: Revert "Failover the hive server2 and metastore services to the standby" [dns] - 10https://gerrit.wikimedia.org/r/1224967 (owner: 10Btullis) [14:49:16] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS trixie [14:49:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1225573 (https://phabricator.wikimedia.org/T409279) (owner: 10Btullis) [14:50:30] !log empty mediasearch-tester group on commons # T372004 [14:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:34] T372004: Remove non-existent user groups on Commons - https://phabricator.wikimedia.org/T372004 [14:50:38] !log empty machinevision-tester group on commons # T372004 [14:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:51] !log zabe@deploy2002 mwscript-k8s job started: foreachwikiindblist all refreshImageMetadata.php --mediatype AUDIO --mime unknown/flac --oldimages --force # T414259 [14:51:54] T414259: MP3 and flac files with wrong MIME type on Commons - https://phabricator.wikimedia.org/T414259 [14:51:59] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [14:52:11] !log zabe@deploy2002 mwscript-k8s job started: foreachwikiindblist all refreshImageMetadata.php --mediatype AUDIO --mime unknown/flac --oldimage --force # T414259 [14:52:40] (03CR) 10Elukey: [C:03+1] CHANGELOG: add changelogs for release v0.11.1 [software/homer] - 10https://gerrit.wikimedia.org/r/1225579 (owner: 10Ayounsi) [14:53:16] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:54:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2004-dev (172.20.5.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:54:43] (03CR) 10Brouberol: "LGTM and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1225573 (https://phabricator.wikimedia.org/T409279) (owner: 10Btullis) [14:54:48] (03CR) 10Brouberol: [C:03+1] Add a second Yubikey SSH key to the account for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1225573 (https://phabricator.wikimedia.org/T409279) (owner: 10Btullis) [14:55:20] (03PS13) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [14:55:23] (03PS5) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) [14:55:53] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [14:56:12] 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on kafka-main1008 - https://phabricator.wikimedia.org/T414101#11512569 (10Jclark-ctr) The replacement part arrived and has been installed. This has resolved the issue. [14:56:20] 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on kafka-main1008 - https://phabricator.wikimedia.org/T414101#11512572 (10Jclark-ctr) 05Open→03Resolved [14:56:31] (03CR) 10Ssingh: [C:03+1] "Self-approval +1 and TIL about this group :)" [puppet] - 10https://gerrit.wikimedia.org/r/1225534 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [14:57:15] (03CR) 10Muehlenhoff: [C:03+2] Add Sukhbir as approver for varnish-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/1225534 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [14:57:59] (03CR) 10Zabe: [C:03+2] Stop setting $wgBlockTargetMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225165 (https://phabricator.wikimedia.org/T355034) (owner: 10Zabe) [14:58:06] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11512579 (10MoritzMuehlenhoff) [14:58:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P87366 and previous config saved to /var/cache/conftool/dbconfig/20260112-145852-marostegui.json [14:59:15] (03Merged) 10jenkins-bot: Stop setting $wgBlockTargetMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225165 (https://phabricator.wikimedia.org/T355034) (owner: 10Zabe) [14:59:38] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1225165|Stop setting $wgBlockTargetMigrationStage (T355034)]] [14:59:41] T355034: Deploy new block_target schema - https://phabricator.wikimedia.org/T355034 [14:59:46] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [14:59:57] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11512594 (10ABran-WMF) [15:01:32] (03Merged) 10jenkins-bot: sre.hosts.reimage: remove puppet 5 support and default to 7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [15:01:46] !log zabe@deploy2002 zabe: Backport for [[gerrit:1225165|Stop setting $wgBlockTargetMigrationStage (T355034)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:02:20] !log zabe@deploy2002 zabe: Continuing with sync [15:02:25] (03CR) 10Btullis: [C:03+2] Add a second Yubikey SSH key to the account for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1225573 (https://phabricator.wikimedia.org/T409279) (owner: 10Btullis) [15:04:11] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:07] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:17] !log zabe@deploy2002 mwscript-k8s job started: foreachwikiindblist all refreshImageMetadata.php --mediatype AUDIO --mime unknown/mpeg --force # T414259 [15:05:20] T414259: MP3 and flac files with wrong MIME type on Commons - https://phabricator.wikimedia.org/T414259 [15:06:20] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225165|Stop setting $wgBlockTargetMigrationStage (T355034)]] (duration: 06m 43s) [15:06:24] T355034: Deploy new block_target schema - https://phabricator.wikimedia.org/T355034 [15:07:09] (03CR) 10Zabe: [C:03+2] Stop updating Deadendpages and Lonelypages on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225118 (https://phabricator.wikimedia.org/T371662) (owner: 10Zabe) [15:07:10] (03CR) 10Zabe: [C:03+2] Disable updates for Special:GloballyUnusedFiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225088 (https://phabricator.wikimedia.org/T414202) (owner: 10Zabe) [15:08:01] (03Merged) 10jenkins-bot: Disable updates for Special:GloballyUnusedFiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225088 (https://phabricator.wikimedia.org/T414202) (owner: 10Zabe) [15:08:03] (03Merged) 10jenkins-bot: Stop updating Deadendpages and Lonelypages on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225118 (https://phabricator.wikimedia.org/T371662) (owner: 10Zabe) [15:08:29] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1225088|Disable updates for Special:GloballyUnusedFiles (T414202)]], [[gerrit:1225118|Stop updating Deadendpages and Lonelypages on commons (T371662)]] [15:08:34] T414202: Disable GloballyUnusedFiles special page on commons - https://phabricator.wikimedia.org/T414202 [15:08:35] T371662: Disable LonelyPages and Deadendpages on commons - https://phabricator.wikimedia.org/T371662 [15:08:52] (03PS2) 10Giuseppe Lavagetto: cache:haproxy: Add Lua-based contact info extraction [puppet] - 10https://gerrit.wikimedia.org/r/1225536 (https://phabricator.wikimedia.org/T414300) [15:09:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P87367 and previous config saved to /var/cache/conftool/dbconfig/20260112-150900-marostegui.json [15:09:07] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2004-dev.codfw.wmnet with reason: host reimage [15:09:11] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:21] !log zabe@deploy2002 zabe: Backport for [[gerrit:1225088|Disable updates for Special:GloballyUnusedFiles (T414202)]], [[gerrit:1225118|Stop updating Deadendpages and Lonelypages on commons (T371662)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:11:25] (03CR) 10Ayounsi: [C:03+2] CHANGELOG: add changelogs for release v0.11.1 [software/homer] - 10https://gerrit.wikimedia.org/r/1225579 (owner: 10Ayounsi) [15:12:52] !log zabe@deploy2002 mwscript-k8s job started: foreachwikiindblist all refreshImageMetadata.php --mediatype AUDIO --mime unknown/mpeg --oldimage --force # T414259 [15:12:56] T414259: MP3 and flac files with wrong MIME type on Commons - https://phabricator.wikimedia.org/T414259 [15:12:59] (03PS1) 10Vgutierrez: cache::haproxy: Fix JWT validation ACL logic [puppet] - 10https://gerrit.wikimedia.org/r/1225584 [15:13:01] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2004-dev.codfw.wmnet with reason: host reimage [15:14:11] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:14:54] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1030.eqiad.wmnet with OS trixie [15:15:20] (03CR) 10Scott French: [C:03+1] cache::haproxy: Fix JWT validation ACL logic [puppet] - 10https://gerrit.wikimedia.org/r/1225584 (owner: 10Vgutierrez) [15:16:09] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-test-druid1001.eqiad.wmnet with OS bookworm [15:16:23] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-druid1001.eqiad.wmnet with OS bookworm [15:17:08] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie [15:19:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T413525)', diff saved to https://phabricator.wikimedia.org/P87368 and previous config saved to /var/cache/conftool/dbconfig/20260112-151908-marostegui.json [15:19:11] FIRING: [6x] ProbeDown: Service wdqs1024:443 has failed probes (http_wdqs_scholarly_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:13] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:19:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [15:19:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-test-druid1001.eqiad.wmnet with OS bookworm [15:19:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T413525)', diff saved to https://phabricator.wikimedia.org/P87369 and previous config saved to /var/cache/conftool/dbconfig/20260112-151934-marostegui.json [15:19:43] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-druid1001.eqiad.wmnet with OS bookworm [15:20:11] (03PS3) 10Tiziano Fogli: Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) [15:20:11] (03PS4) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [15:20:16] !log zabe@deploy2002 Sync cancelled. [15:20:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T413525)', diff saved to https://phabricator.wikimedia.org/P87370 and previous config saved to /var/cache/conftool/dbconfig/20260112-152025-marostegui.json [15:20:41] (03CR) 10CI reject: [V:04-1] Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [15:21:40] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Fix JWT validation ACL logic [puppet] - 10https://gerrit.wikimedia.org/r/1225584 (owner: 10Vgutierrez) [15:23:17] (03PS1) 10Zabe: Correctly disable Special:Deadendpages on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225587 (https://phabricator.wikimedia.org/T371662) [15:24:56] (03CR) 10Zabe: [C:03+2] Correctly disable Special:Deadendpages on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225587 (https://phabricator.wikimedia.org/T371662) (owner: 10Zabe) [15:25:17] 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting deployment access - https://phabricator.wikimedia.org/T414347#11512710 (10BTullis) We will need approval from @Ahoelzl as your manager and from @thcipriani as the approver for the `deployment group. [15:25:52] (03Merged) 10jenkins-bot: Correctly disable Special:Deadendpages on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225587 (https://phabricator.wikimedia.org/T371662) (owner: 10Zabe) [15:25:59] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.11.1 [software/homer] - 10https://gerrit.wikimedia.org/r/1225579 (owner: 10Ayounsi) [15:26:20] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1225088|Disable updates for Special:GloballyUnusedFiles (T414202)]], [[gerrit:1225118|Stop updating Deadendpages and Lonelypages on commons (T371662)]], [[gerrit:1225587|Correctly disable Special:Deadendpages on commons (T371662)]] [15:26:25] T414202: Disable GloballyUnusedFiles special page on commons - https://phabricator.wikimedia.org/T414202 [15:26:25] T371662: Disable LonelyPages and Deadendpages on commons - https://phabricator.wikimedia.org/T371662 [15:28:13] 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting deployment access - https://phabricator.wikimedia.org/T414347#11512733 (10AKhatun_WMF) [15:28:15] !log zabe@deploy2002 zabe: Backport for [[gerrit:1225088|Disable updates for Special:GloballyUnusedFiles (T414202)]], [[gerrit:1225118|Stop updating Deadendpages and Lonelypages on commons (T371662)]], [[gerrit:1225587|Correctly disable Special:Deadendpages on commons (T371662)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:29:03] !log zabe@deploy2002 zabe: Continuing with sync [15:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T1530) [15:30:06] (03PS4) 10Tiziano Fogli: Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) [15:30:13] (03PS5) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [15:30:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P87371 and previous config saved to /var/cache/conftool/dbconfig/20260112-153033-marostegui.json [15:33:07] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225088|Disable updates for Special:GloballyUnusedFiles (T414202)]], [[gerrit:1225118|Stop updating Deadendpages and Lonelypages on commons (T371662)]], [[gerrit:1225587|Correctly disable Special:Deadendpages on commons (T371662)]] (duration: 06m 47s) [15:33:12] T414202: Disable GloballyUnusedFiles special page on commons - https://phabricator.wikimedia.org/T414202 [15:33:12] T371662: Disable LonelyPages and Deadendpages on commons - https://phabricator.wikimedia.org/T371662 [15:33:20] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:34:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2004-dev (172.20.5.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:35:40] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2004-dev.codfw.wmnet with OS trixie [15:35:53] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [15:36:16] (03PS3) 10Hnowlan: nsca_frack.cfg.erb deprecate check_endpoints service and pay-lvs hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/1202827 (https://phabricator.wikimedia.org/T367370) (owner: 10Jgreen) [15:36:33] 06SRE, 06Infrastructure-Foundations: Avoid dhcpcd-base on trixie hosts - https://phabricator.wikimedia.org/T414341#11512785 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:39:11] RESOLVED: [6x] ProbeDown: Service wdqs1024:443 has failed probes (http_wdqs_scholarly_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:39:53] (03PS1) 10Elukey: pyrra: update the MWHC SLO [puppet] - 10https://gerrit.wikimedia.org/r/1225594 (https://phabricator.wikimedia.org/T401892) [15:39:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [15:40:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P87373 and previous config saved to /var/cache/conftool/dbconfig/20260112-154041-marostegui.json [15:41:21] (03CR) 10A-pizzata: [C:03+1] pyrra: update the MWHC SLO [puppet] - 10https://gerrit.wikimedia.org/r/1225594 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey) [15:42:39] (03CR) 10Herron: [C:03+2] nsca_frack.cfg.erb deprecate check_endpoints service and pay-lvs hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/1202827 (https://phabricator.wikimedia.org/T367370) (owner: 10Jgreen) [15:43:13] (03PS2) 10Elukey: pyrra: update the MWHC SLO [puppet] - 10https://gerrit.wikimedia.org/r/1225594 (https://phabricator.wikimedia.org/T401892) [15:43:46] (03CR) 10A-pizzata: [C:03+1] pyrra: update the MWHC SLO [puppet] - 10https://gerrit.wikimedia.org/r/1225594 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey) [15:45:59] (03CR) 10Elukey: [C:03+2] pyrra: update the MWHC SLO [puppet] - 10https://gerrit.wikimedia.org/r/1225594 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey) [15:46:37] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11512830 (10Dzahn) [15:50:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T413525)', diff saved to https://phabricator.wikimedia.org/P87374 and previous config saved to /var/cache/conftool/dbconfig/20260112-155049-marostegui.json [15:50:54] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:51:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1206.eqiad.wmnet with reason: Maintenance [15:51:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1206 (T413525)', diff saved to https://phabricator.wikimedia.org/P87375 and previous config saved to /var/cache/conftool/dbconfig/20260112-155113-marostegui.json [15:53:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T413525)', diff saved to https://phabricator.wikimedia.org/P87376 and previous config saved to /var/cache/conftool/dbconfig/20260112-155356-marostegui.json [15:54:31] (03PS1) 10Vgutierrez: cache::haproxy: Consider sessionJwt cookie for JWT validation purposes [puppet] - 10https://gerrit.wikimedia.org/r/1225598 (https://phabricator.wikimedia.org/T400238) [15:55:42] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb1002.eqiad.wmnet with OS trixie [15:56:39] (03CR) 10Brouberol: "@dcausse@wikimedia.org When applying this patch on airflow-search, a `wikimedia-enteprise.yaml` file will be created under `/opt/airflow/s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse) [15:57:10] (03CR) 10Brouberol: "we could define a `read_secret_file` *function*" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse) [15:58:04] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:58:35] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1031.eqiad.wmnet with OS trixie [15:59:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-d5-eqiad and cloudlb1002 (172.20.2.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:00:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:03:28] (03CR) 10Scott French: [C:03+1] "Thanks Valentin! This looks structurally good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1225598 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [16:04:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P87377 and previous config saved to /var/cache/conftool/dbconfig/20260112-160404-marostegui.json [16:08:01] (03PS2) 10Bking: elasticsearch: cleanup unused roles/profiles after migration to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1224691 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [16:10:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:11:05] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb1002.eqiad.wmnet with reason: host reimage [16:12:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224691 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [16:12:27] (03PS1) 10Btullis: Add three new dse-k8s-workers in eqiad to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1225601 (https://phabricator.wikimedia.org/T414216) [16:14:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P87378 and previous config saved to /var/cache/conftool/dbconfig/20260112-161412-marostegui.json [16:15:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb1002.eqiad.wmnet with reason: host reimage [16:16:12] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-test-druid1001.eqiad.wmnet with OS bookworm [16:16:17] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-druid1001.eqiad.wmnet with OS bookworm [16:17:59] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1225601 (https://phabricator.wikimedia.org/T414216) (owner: 10Btullis) [16:22:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T413525)', diff saved to https://phabricator.wikimedia.org/P87379 and previous config saved to /var/cache/conftool/dbconfig/20260112-162204-marostegui.json [16:22:08] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [16:23:44] (03CR) 10Brouberol: [C:03+1] Add three new dse-k8s-workers in eqiad to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1225601 (https://phabricator.wikimedia.org/T414216) (owner: 10Btullis) [16:24:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T413525)', diff saved to https://phabricator.wikimedia.org/P87380 and previous config saved to /var/cache/conftool/dbconfig/20260112-162421-marostegui.json [16:24:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [16:24:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T413525)', diff saved to https://phabricator.wikimedia.org/P87381 and previous config saved to /var/cache/conftool/dbconfig/20260112-162445-marostegui.json [16:24:55] (03PS2) 10Seawolf35gerrit: ukwiki: Various changes to user rights. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225596 (https://phabricator.wikimedia.org/T414277) [16:27:02] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:27:45] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:30:05] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T1630). Please do the needful. [16:30:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [16:32:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P87382 and previous config saved to /var/cache/conftool/dbconfig/20260112-163212-marostegui.json [16:33:09] !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments:revalidateLinkRecommendations.php --wiki=plwiki --all --verbose # revalidate-addlink [16:34:05] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:34:11] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:35:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [16:35:42] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb1002.eqiad.wmnet with OS trixie [16:39:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-d5-eqiad and cloudlb1002 (172.20.2.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:42:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P87383 and previous config saved to /var/cache/conftool/dbconfig/20260112-164220-marostegui.json [16:42:31] (03PS1) 10Ahmon Dancy: Yubikey-SSH-FIDO: add second new key for dancy [puppet] - 10https://gerrit.wikimedia.org/r/1225609 (https://phabricator.wikimedia.org/T414032) [16:43:26] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11513157 (10VRiley-WMF) Opened a ticket and ordered a part. Dell ticket is 464471567 [16:44:09] (03CR) 10Bking: [C:03+2] "The build failure actually proves that these roles/profiles are not used by any existing hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1224691 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [16:44:45] (03PS1) 10Elukey: sre.hosts.reimage: avoid checking self.identifier for VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1225611 [16:46:49] (03PS3) 10Gehel: chore(elasticsearch): cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607) [16:47:16] (03PS4) 10Bking: elasticsearch: cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [16:47:20] (03CR) 10CI reject: [V:04-1] elasticsearch: cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [16:47:46] (03CR) 10CI reject: [V:04-1] elasticsearch: cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [16:48:50] (03PS5) 10Bking: elasticsearch: cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [16:49:49] (03CR) 10Bking: [C:03+2] elasticsearch: cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [16:52:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T413525)', diff saved to https://phabricator.wikimedia.org/P87384 and previous config saved to /var/cache/conftool/dbconfig/20260112-165229-marostegui.json [16:52:34] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [16:52:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1218.eqiad.wmnet with reason: Maintenance [16:52:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1218 (T413525)', diff saved to https://phabricator.wikimedia.org/P87385 and previous config saved to /var/cache/conftool/dbconfig/20260112-165253-marostegui.json [16:54:07] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on kafka-main1008:9290 - https://phabricator.wikimedia.org/T414344#11513199 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Checked the power supply and everything seems to be nominal. This seems to be a false alarm. Closing this fo... [16:56:26] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11513205 (10elukey) 05Open→03Resolved a:03elukey The issue should be fixed now thanks to https://gerrit.wikimedia.org/r/c/o... [16:57:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T413525)', diff saved to https://phabricator.wikimedia.org/P87386 and previous config saved to /var/cache/conftool/dbconfig/20260112-165718-marostegui.json [16:57:35] (03CR) 10Scott French: "Nice! There will invariably be some things we need to tweak in the script, but this provides us with a solid foundation to do that with co" [puppet] - 10https://gerrit.wikimedia.org/r/1225536 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [16:58:17] (03PS2) 10Gehel: chore(elasticsearch): remove references to elasticsearch for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/1224712 (https://phabricator.wikimedia.org/T388607) [17:00:49] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1225609 (https://phabricator.wikimedia.org/T414032) (owner: 10Ahmon Dancy) [17:00:52] (03CR) 10Muehlenhoff: [C:03+2] Yubikey-SSH-FIDO: add second new key for dancy [puppet] - 10https://gerrit.wikimedia.org/r/1225609 (https://phabricator.wikimedia.org/T414032) (owner: 10Ahmon Dancy) [17:01:48] (03CR) 10JHathaway: [C:03+1] sre.hosts.reimage: avoid checking self.identifier for VMs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1225611 (owner: 10Elukey) [17:04:01] (03PS2) 10Elukey: sre.hosts.reimage: avoid checking self.identifier for VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1225611 [17:04:43] (03PS1) 10Jgiannelos: ProofreadPage: Disable flag to rennder using parsoid temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225613 [17:05:05] (03CR) 10Elukey: sre.hosts.reimage: avoid checking self.identifier for VMs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1225611 (owner: 10Elukey) [17:05:10] (03PS2) 10Jgiannelos: ProofreadPage: Disable flag to render using parsoid temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225613 (https://phabricator.wikimedia.org/T406088) [17:06:38] jouncebot nowandnext [17:06:38] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [17:06:38] In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T1800) [17:06:38] In 0 hour(s) and 53 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T1800) [17:07:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P87387 and previous config saved to /var/cache/conftool/dbconfig/20260112-170727-marostegui.json [17:09:00] (03PS3) 10Jgiannelos: ProofreadPage: Disable flag to render using parsoid temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225613 (https://phabricator.wikimedia.org/T408915) [17:09:04] (03CR) 10DCausse: "Thanks for taking a look!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse) [17:13:46] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11513264 (10Ahoelzl) Approved. [17:15:46] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11513271 (10DSantamaria) Approved! [17:17:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P87388 and previous config saved to /var/cache/conftool/dbconfig/20260112-171735-marostegui.json [17:18:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219939 (https://phabricator.wikimedia.org/T412975) (owner: 10Thcipriani) [17:19:41] (03Merged) 10jenkins-bot: Beta: update mx host ip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219939 (https://phabricator.wikimedia.org/T412975) (owner: 10Thcipriani) [17:23:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T413525)', diff saved to https://phabricator.wikimedia.org/P87389 and previous config saved to /var/cache/conftool/dbconfig/20260112-172259-marostegui.json [17:23:04] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:24:02] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: avoid checking self.identifier for VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1225611 (owner: 10Elukey) [17:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:24:24] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: avoid checking self.identifier for VMs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1225611 (owner: 10Elukey) [17:25:43] 06SRE, 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11513325 (10trueg) ok, I was not aware that there are two grafana intances. I was trying to log into the ro one. I can perfectly log into the rw one. Sorry for the noise, [17:27:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T413525)', diff saved to https://phabricator.wikimedia.org/P87390 and previous config saved to /var/cache/conftool/dbconfig/20260112-172744-marostegui.json [17:28:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [17:28:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T413525)', diff saved to https://phabricator.wikimedia.org/P87391 and previous config saved to /var/cache/conftool/dbconfig/20260112-172808-marostegui.json [17:28:13] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:28:52] (03PS1) 10Ahmon Dancy: /home/dancy/src/wmf/operations/puppet/hieradata/cloud/eqiad1/deployment-prep/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1225620 (https://phabricator.wikimedia.org/T412975) [17:29:24] (03CR) 10CI reject: [V:04-1] /home/dancy/src/wmf/operations/puppet/hieradata/cloud/eqiad1/deployment-prep/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1225620 (https://phabricator.wikimedia.org/T412975) (owner: 10Ahmon Dancy) [17:30:36] (03PS2) 10Ahmon Dancy: deployment-prep common.yaml: Update mediawiki_smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/1225620 (https://phabricator.wikimedia.org/T412975) [17:33:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P87392 and previous config saved to /var/cache/conftool/dbconfig/20260112-173308-marostegui.json [17:40:55] 06SRE, 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11513413 (10Dzahn) 05Open→03Resolved a:03Dzahn ah! cool. glad it works [17:41:08] (03Abandoned) 10Dzahn: mx/spamassassin: allow overriding sa daemon package name in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [17:43:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P87393 and previous config saved to /var/cache/conftool/dbconfig/20260112-174316-marostegui.json [17:53:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T413525)', diff saved to https://phabricator.wikimedia.org/P87394 and previous config saved to /var/cache/conftool/dbconfig/20260112-175324-marostegui.json [17:53:29] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:53:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance [17:53:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1219 (T413525)', diff saved to https://phabricator.wikimedia.org/P87395 and previous config saved to /var/cache/conftool/dbconfig/20260112-175349-marostegui.json [17:54:53] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:55:53] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Consider sessionJwt cookie for JWT validation purposes [puppet] - 10https://gerrit.wikimedia.org/r/1225598 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [17:59:44] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T1800) [18:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T1800). [18:02:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T413525)', diff saved to https://phabricator.wikimedia.org/P87396 and previous config saved to /var/cache/conftool/dbconfig/20260112-180247-marostegui.json [18:02:51] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:09:25] (03PS1) 10Ahmon Dancy: data.yaml: Drop old ssh key for adancy@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1225628 (https://phabricator.wikimedia.org/T414032) [18:10:34] dancy: want that deployed? [18:10:54] Sure! I should be ready to go. [18:11:06] (03CR) 10Dzahn: [C:03+2] data.yaml: Drop old ssh key for adancy@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1225628 (https://phabricator.wikimedia.org/T414032) (owner: 10Ahmon Dancy) [18:12:14] dancy: running puppet on bastion hosts. let us know if anything unexpected [18:12:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P87397 and previous config saved to /var/cache/conftool/dbconfig/20260112-181255-marostegui.json [18:15:34] Thanks mutante! [18:16:01] 06SRE, 10SRE-Access-Requests, 06Release-Engineering-Team, 13Patch-For-Review: Add yubikey ssh key for dancy - https://phabricator.wikimedia.org/T414032#11513565 (10Dzahn) [18:16:40] 06SRE, 10SRE-Access-Requests, 06Release-Engineering-Team, 13Patch-For-Review: Add yubikey ssh key for dancy - https://phabricator.wikimedia.org/T414032#11513566 (10dancy) 05Open→03Resolved [18:17:40] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb1001.eqiad.wmnet with OS trixie [18:19:05] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:20:25] (03PS1) 10DDesouza: Undeploy Safety survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225631 (https://phabricator.wikimedia.org/T413022) [18:20:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:23:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P87398 and previous config saved to /var/cache/conftool/dbconfig/20260112-182303-marostegui.json [18:24:27] (03PS1) 10Vgutierrez: cache::haproxy: Support sxp JWT field [puppet] - 10https://gerrit.wikimedia.org/r/1225633 (https://phabricator.wikimedia.org/T400238) [18:24:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T413525)', diff saved to https://phabricator.wikimedia.org/P87399 and previous config saved to /var/cache/conftool/dbconfig/20260112-182431-marostegui.json [18:24:36] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:26:13] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [18:26:49] (03CR) 10Scott French: [C:03+1] "Thanks, Valentin!" [puppet] - 10https://gerrit.wikimedia.org/r/1225633 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [18:27:34] 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11513615 (10Ragesoss) @Joe checking my Sentry logs, I see we're still getting 429 for some types of queries, including Commons API queries and fetching page content (... [18:28:43] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Support sxp JWT field [puppet] - 10https://gerrit.wikimedia.org/r/1225633 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [18:28:58] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:29:05] (03CR) 10Scott French: [C:03+1] "Thanks for tracking this down!" [puppet] - 10https://gerrit.wikimedia.org/r/1225526 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [18:29:08] (03CR) 10Subramanya Sastry: [C:03+1] ProofreadPage: Disable flag to render using parsoid temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225613 (https://phabricator.wikimedia.org/T408915) (owner: 10Jgiannelos) [18:30:36] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1365 [18:30:37] !log vriley@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker1365 [18:30:49] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1365 [18:30:49] !log vriley@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker1365 [18:33:09] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb1001.eqiad.wmnet with reason: host reimage [18:33:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T413525)', diff saved to https://phabricator.wikimedia.org/P87400 and previous config saved to /var/cache/conftool/dbconfig/20260112-183311-marostegui.json [18:33:17] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:33:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2188.codfw.wmnet with reason: Maintenance [18:33:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T413525)', diff saved to https://phabricator.wikimedia.org/P87401 and previous config saved to /var/cache/conftool/dbconfig/20260112-183336-marostegui.json [18:34:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P87402 and previous config saved to /var/cache/conftool/dbconfig/20260112-183440-marostegui.json [18:35:25] (03CR) 10Scott French: [C:03+1] "Thanks, Blake! The best kind of patch :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1225500 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [18:38:31] (03PS1) 10Vgutierrez: cache::haproxy: Fix X-JWT-Sub value [puppet] - 10https://gerrit.wikimedia.org/r/1225636 (https://phabricator.wikimedia.org/T400238) [18:39:20] (03CR) 10CDanis: [C:03+1] cache::haproxy: Fix X-JWT-Sub value [puppet] - 10https://gerrit.wikimedia.org/r/1225636 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [18:39:31] (03CR) 10Scott French: [C:03+1] cache::haproxy: Fix X-JWT-Sub value [puppet] - 10https://gerrit.wikimedia.org/r/1225636 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [18:39:42] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb1001.eqiad.wmnet with reason: host reimage [18:40:05] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [18:40:33] (03CR) 10Bking: [C:03+2] chore(elasticsearch): remove references to elasticsearch for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/1224712 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [18:40:39] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Fix X-JWT-Sub value [puppet] - 10https://gerrit.wikimedia.org/r/1225636 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [18:41:19] ok to merge Brian King: chore(elasticsearch): remove references to elasticsearch for cloudelastic (8c4203165e)? [18:41:30] vgutierrez yes, thanks! [18:41:33] thx [18:42:46] (03CR) 10Bking: [C:03+2] chore(elasticsearch): cloudelastic1001-1004 have been decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1224713 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [18:43:42] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1366 - vriley@cumin1003" [18:43:46] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1366 - vriley@cumin1003" [18:43:46] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:44:00] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1366 [18:44:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1366 [18:44:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P87403 and previous config saved to /var/cache/conftool/dbconfig/20260112-184448-marostegui.json [18:44:49] (03CR) 10Bking: [C:03+2] chore(elasticsearch): remove references to elasticsearch for cluster config [puppet] - 10https://gerrit.wikimedia.org/r/1224714 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [18:45:00] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1366.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:49:48] (03PS1) 10Bking: cirrussearch: remove defunct regexes [puppet] - 10https://gerrit.wikimedia.org/r/1225639 (https://phabricator.wikimedia.org/T388607) [18:50:05] got a couple of reports of loading issues & frontend errors from european users [18:50:49] (03PS2) 10Bking: cirrussearch: remove defunct regexes [puppet] - 10https://gerrit.wikimedia.org/r/1225639 (https://phabricator.wikimedia.org/T388607) [18:51:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225639 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [18:52:29] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1366.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:52:42] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [18:53:31] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1366.eqiad.wmnet with OS trixie [18:53:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11513719 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1366.eqiad.wmnet with OS trixie [18:54:23] AntiComposite: thanks, do you have any more detail? [18:54:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T413525)', diff saved to https://phabricator.wikimedia.org/P87404 and previous config saved to /var/cache/conftool/dbconfig/20260112-185456-marostegui.json [18:55:01] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:55:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1232.eqiad.wmnet with reason: Maintenance [18:55:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T413525)', diff saved to https://phabricator.wikimedia.org/P87405 and previous config saved to /var/cache/conftool/dbconfig/20260112-185521-marostegui.json [18:55:41] cdanis, At least one was a ratelimit error, (3231bf1), trying to see if I can get more than that [18:56:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11513724 (10VRiley-WMF) [18:57:07] apparently all the errors are ratelimits, and all (now 3) reporting users are Dutch [18:58:04] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:00:07] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb1001.eqiad.wmnet with OS trixie [19:00:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:01:12] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [19:01:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T413525)', diff saved to https://phabricator.wikimedia.org/P87406 and previous config saved to /var/cache/conftool/dbconfig/20260112-190144-marostegui.json [19:01:48] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:04:18] AntiComposite: thanks, we are pretty sure we're rolling out a fix rn [19:04:27] (03PS3) 10Bking: cirrussearch: remove defunct regexes [puppet] - 10https://gerrit.wikimedia.org/r/1225639 (https://phabricator.wikimedia.org/T388607) [19:04:36] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1366.eqiad.wmnet with reason: host reimage [19:04:44] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1367 - vriley@cumin1003" [19:04:48] (03CR) 10BryanDavis: mailman: add UpstreamTlsContext on tlsproxy::envoy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219770 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [19:04:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1367 - vriley@cumin1003" [19:04:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:05:03] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1367 [19:05:20] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1367 [19:05:50] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1367.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:06:29] (03CR) 10BryanDavis: "Poke. The lack of this hiera expansion just caused T414304 via I8f5a1b9221d2f6b644f9f1956a349326b730873a. Can we look at this again as a w" [puppet] - 10https://gerrit.wikimedia.org/r/702326 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [19:07:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11513775 (10VRiley-WMF) [19:09:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225639 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [19:09:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1366.eqiad.wmnet with reason: host reimage [19:10:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11513789 (10VRiley-WMF) [19:11:43] AntiComposite: should be resolved, please feel free to ping me if you hear about any more of those after now [19:11:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P87407 and previous config saved to /var/cache/conftool/dbconfig/20260112-191152-marostegui.json [19:11:55] (03PS1) 10Jforrester: [WIP] Defensively set Abstract Wikipedia feature flags to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225649 (https://phabricator.wikimedia.org/T411690) [19:12:30] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on kafka-main1008:9290 - https://phabricator.wikimedia.org/T414344#11513796 (10Jclark-ctr) This was open as duplicate when My RMA PSU was added to server without power. T414101 Timing matches when psu inserted. Verified Idrac nothing... [19:12:37] 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on kafka-main1008 - https://phabricator.wikimedia.org/T414101#11513799 (10Jclark-ctr) [19:12:40] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on kafka-main1008:9290 - https://phabricator.wikimedia.org/T414344#11513802 (10Jclark-ctr) →14Duplicate dup:03T414101 [19:12:56] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1367.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:14:27] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1367.eqiad.wmnet with OS trixie [19:14:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11513804 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [19:14:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11513806 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1367.eqiad.wmnet with OS trixie [19:14:45] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [19:14:55] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11513808 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [19:15:25] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994#11513809 (10KFrancis) Hi all, the NDA is complete. Thanks! [19:18:45] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [19:22:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P87408 and previous config saved to /var/cache/conftool/dbconfig/20260112-192201-marostegui.json [19:22:15] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1368 - vriley@cumin1003" [19:22:20] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1368 - vriley@cumin1003" [19:22:20] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:24:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T413525)', diff saved to https://phabricator.wikimedia.org/P87409 and previous config saved to /var/cache/conftool/dbconfig/20260112-192401-marostegui.json [19:24:05] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:25:35] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:25:48] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1367.eqiad.wmnet with reason: host reimage [19:26:06] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:26:13] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1366.eqiad.wmnet with OS trixie [19:26:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11513866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1366.eqiad.wmnet with OS trixie completed: - wikikub... [19:26:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11513867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1366.eqiad.wmnet with OS trixie executed with errors... [19:27:10] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1368 [19:27:24] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1368 [19:29:23] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1368.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:29:43] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1367.eqiad.wmnet with reason: host reimage [19:32:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T413525)', diff saved to https://phabricator.wikimedia.org/P87411 and previous config saved to /var/cache/conftool/dbconfig/20260112-193209-marostegui.json [19:32:13] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:32:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2202.codfw.wmnet with reason: Maintenance [19:33:56] 10ops-codfw, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q2): Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11513890 (10Jhancock.wm) @herron two things - do you mind if i rack this in the new expansion cage at codfw? - can you add this one and the mwlog1003 in T412230 to... [19:34:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P87412 and previous config saved to /var/cache/conftool/dbconfig/20260112-193409-marostegui.json [19:34:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374 (10RobH) 03NEW [19:35:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11513905 (10Jhancock.wm) @Clement_Goubert the servers landed last week. Gonna start unpacking them tomorrow or wednesday. Are they able to be racked int he E/F? [19:35:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11513912 (10RobH) [19:36:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11513913 (10VRiley-WMF) [19:36:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1368.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:39:53] 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for hmonroy - https://phabricator.wikimedia.org/T414375#11513931 (10HMonroy) [19:40:43] 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for hmonroy - https://phabricator.wikimedia.org/T414375#11513933 (10HMonroy) [19:41:31] PROBLEM - MariaDB Replica Lag: m2 on db1217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 21665.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:43:32] 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for hmonroy - https://phabricator.wikimedia.org/T414375#11513941 (10HMonroy) [19:44:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P87413 and previous config saved to /var/cache/conftool/dbconfig/20260112-194417-marostegui.json [19:46:12] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11513947 (10Jclark-ctr) @Clement_Goubert Could you add these to site.pp file additionally? [19:46:12] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:49:17] vriley@cumin1003 reimage (PID 1099119) is awaiting input [19:49:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [19:49:37] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1367.eqiad.wmnet with OS trixie [19:49:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11513959 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1367.eqiad.wmnet with OS trixie completed: - wikikub... [19:49:55] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-22] - https://phabricator.wikimedia.org/T414216#11513961 (10Jclark-ctr) @BTullis I see these servers in preseed, but when I check site.pp, they haven’t been merged or were removed? [19:50:14] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1368.eqiad.wmnet with OS trixie [19:50:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11513980 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1368.eqiad.wmnet with OS trixie [19:52:31] RECOVERY - MariaDB Replica Lag: m2 on db1217 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:54:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T413525)', diff saved to https://phabricator.wikimedia.org/P87414 and previous config saved to /var/cache/conftool/dbconfig/20260112-195425-marostegui.json [19:54:30] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:54:32] (03PS1) 10DDesouza: Undeploy Safety survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225664 (https://phabricator.wikimedia.org/T413022) [19:54:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1234.eqiad.wmnet with reason: Maintenance [19:54:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T413525)', diff saved to https://phabricator.wikimedia.org/P87415 and previous config saved to /var/cache/conftool/dbconfig/20260112-195450-marostegui.json [19:56:00] (03Abandoned) 10DDesouza: Undeploy Safety survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225631 (https://phabricator.wikimedia.org/T413022) (owner: 10DDesouza) [19:56:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225664 (https://phabricator.wikimedia.org/T413022) (owner: 10DDesouza) [19:57:02] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [19:57:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2203.codfw.wmnet with reason: Maintenance [19:57:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2203 (T413525)', diff saved to https://phabricator.wikimedia.org/P87416 and previous config saved to /var/cache/conftool/dbconfig/20260112-195731-marostegui.json [20:00:36] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1369 - vriley@cumin1003" [20:00:40] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1369 - vriley@cumin1003" [20:00:41] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:00:55] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1369 [20:01:11] PROBLEM - MariaDB Replica Lag: m2 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:01:21] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1368.eqiad.wmnet with reason: host reimage [20:01:22] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1369 [20:01:51] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1369.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:04:23] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1368.eqiad.wmnet with reason: host reimage [20:05:07] (03CR) 10JHathaway: sre.hosts.reimage: avoid checking self.identifier for VMs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1225611 (owner: 10Elukey) [20:08:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11514017 (10VRiley-WMF) [20:09:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1369.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:10:38] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1369.eqiad.wmnet with OS trixie [20:10:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11514029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1369.eqiad.wmnet with OS trixie [20:15:22] (03CR) 10Andriy.v: [C:03+1] ukwiki: Various changes to user rights. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225596 (https://phabricator.wikimedia.org/T414277) (owner: 10Seawolf35gerrit) [20:16:11] RECOVERY - MariaDB Replica Lag: m2 on db2160 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:17:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11514052 (10Jhancock.wm) @ssingh do you need assistance getting these reimaged? [20:19:59] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:21:30] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1369.eqiad.wmnet with reason: host reimage [20:23:04] vriley@cumin1003 reimage (PID 1107184) is awaiting input [20:25:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T413525)', diff saved to https://phabricator.wikimedia.org/P87417 and previous config saved to /var/cache/conftool/dbconfig/20260112-202512-marostegui.json [20:25:17] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:25:25] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1369.eqiad.wmnet with reason: host reimage [20:25:29] (03PS1) 10Ayounsi: Release v0.11.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1225671 [20:28:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T413525)', diff saved to https://phabricator.wikimedia.org/P87418 and previous config saved to /var/cache/conftool/dbconfig/20260112-202823-marostegui.json [20:28:38] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [20:29:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11514105 (10VRiley-WMF) [20:30:56] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:30:58] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1368.eqiad.wmnet with OS trixie [20:31:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11514108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1368.eqiad.wmnet with OS trixie completed: - wikikub... [20:33:09] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1370 - vriley@cumin1003" [20:33:13] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1370 - vriley@cumin1003" [20:33:13] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:34:11] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:35:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P87419 and previous config saved to /var/cache/conftool/dbconfig/20260112-203521-marostegui.json [20:36:56] FIRING: MaxConntrack: Max conntrack at 99.99% on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:37:37] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971#11514147 (10TheDJ) >>! In T412971#11498230, @AntiCompositeNumber wrote: > Special:NewFiles doesn't appear to be as bad as it was a few ye... [20:38:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P87420 and previous config saved to /var/cache/conftool/dbconfig/20260112-203831-marostegui.json [20:38:32] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [20:41:15] (03CR) 10Btullis: [C:03+1] cirrussearch: remove defunct regexes [puppet] - 10https://gerrit.wikimedia.org/r/1225639 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [20:41:56] RESOLVED: MaxConntrack: Max conntrack at 97.67% on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:41:56] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1371 - vriley@cumin1003" [20:42:01] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1371 - vriley@cumin1003" [20:42:01] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:42:32] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1370 [20:42:34] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:42:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1370 [20:42:56] FIRING: MaxConntrack: Max conntrack at 99.99% on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:43:13] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1370 [20:43:19] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1370 [20:44:01] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:44:45] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1371 [20:44:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1371 [20:45:10] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:45:11] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1369.eqiad.wmnet with OS trixie [20:45:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11514191 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1369.eqiad.wmnet with OS trixie completed: - wikikub... [20:45:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P87421 and previous config saved to /var/cache/conftool/dbconfig/20260112-204529-marostegui.json [20:47:33] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1371.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:47:36] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:47:56] RESOLVED: MaxConntrack: Max conntrack at 99.99% on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:48:29] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:48:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P87422 and previous config saved to /var/cache/conftool/dbconfig/20260112-204840-marostegui.json [20:49:15] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1371.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:50:34] (03PS1) 10Clare Ming: Revert to `product_metrics` schemas and use `default` as the coordinator value [extensions/TestKitchen] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225675 (https://phabricator.wikimedia.org/T407901) [20:50:36] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:51:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/TestKitchen] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225675 (https://phabricator.wikimedia.org/T407901) (owner: 10Clare Ming) [20:53:14] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:55:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T413525)', diff saved to https://phabricator.wikimedia.org/P87423 and previous config saved to /var/cache/conftool/dbconfig/20260112-205537-marostegui.json [20:55:42] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:55:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1235.eqiad.wmnet with reason: Maintenance [20:56:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T413525)', diff saved to https://phabricator.wikimedia.org/P87424 and previous config saved to /var/cache/conftool/dbconfig/20260112-205602-marostegui.json [20:57:02] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [20:57:24] (03CR) 10CI reject: [V:04-1] Revert to `product_metrics` schemas and use `default` as the coordinator value [extensions/TestKitchen] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225675 (https://phabricator.wikimedia.org/T407901) (owner: 10Clare Ming) [20:57:37] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:57:50] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:58:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T413525)', diff saved to https://phabricator.wikimedia.org/P87425 and previous config saved to /var/cache/conftool/dbconfig/20260112-205848-marostegui.json [20:59:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2216.codfw.wmnet with reason: Maintenance [20:59:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T413525)', diff saved to https://phabricator.wikimedia.org/P87426 and previous config saved to /var/cache/conftool/dbconfig/20260112-205912-marostegui.json [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T2100). [21:00:05] cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:19] o/ [21:00:40] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1372 - vriley@cumin1003" [21:00:45] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1372 - vriley@cumin1003" [21:00:45] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:01:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [21:01:41] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1372 [21:01:46] vriley@cumin1003 provision (PID 1120184) is awaiting input [21:01:51] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1372 [21:02:13] (03Merged) 10jenkins-bot: Deploy TestKitchen to Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [21:02:32] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1217360|Deploy TestKitchen to Beta Cluster (T407806 T407805)]] [21:02:38] T407806: Rename Metrics Platform Extension to Test Kitchen - https://phabricator.wikimedia.org/T407806 [21:02:38] T407805: Rename mpic.wikimedia.org - https://phabricator.wikimedia.org/T407805 [21:02:42] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1372.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:04:43] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1372.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:05:56] (03PS1) 10Clare Ming: Revert "Deploy TestKitchen to Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225677 [21:06:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225677 (owner: 10Clare Ming) [21:06:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225677 (owner: 10Clare Ming) [21:07:53] (03Merged) 10jenkins-bot: Revert "Deploy TestKitchen to Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225677 (owner: 10Clare Ming) [21:08:13] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1225677|Revert "Deploy TestKitchen to Beta Cluster"]] [21:10:04] !log cjming@deploy2002 cjming: Backport for [[gerrit:1225677|Revert "Deploy TestKitchen to Beta Cluster"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:11:08] (03Abandoned) 10Clare Ming: Revert to `product_metrics` schemas and use `default` as the coordinator value [extensions/TestKitchen] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225675 (https://phabricator.wikimedia.org/T407901) (owner: 10Clare Ming) [21:11:32] (03Restored) 10Clare Ming: Revert to `product_metrics` schemas and use `default` as the coordinator value [extensions/TestKitchen] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225675 (https://phabricator.wikimedia.org/T407901) (owner: 10Clare Ming) [21:11:50] (03CR) 10Clare Ming: "recheck" [extensions/TestKitchen] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225675 (https://phabricator.wikimedia.org/T407901) (owner: 10Clare Ming) [21:12:23] (03PS1) 10Clare Ming: Revert^2 "Deploy TestKitchen to Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225678 [21:15:16] !log cjming@deploy2002 cjming: Continuing with sync [21:18:34] (03Abandoned) 10Ebernhardson: dumps: Repoint cirrus dumps to new location [puppet] - 10https://gerrit.wikimedia.org/r/1223722 (https://phabricator.wikimedia.org/T366248) (owner: 10Ebernhardson) [21:19:26] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225677|Revert "Deploy TestKitchen to Beta Cluster"]] (duration: 11m 13s) [21:21:29] (03CR) 10Bking: [C:03+2] cirrussearch: remove defunct regexes [puppet] - 10https://gerrit.wikimedia.org/r/1225639 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [21:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:25:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11514379 (10KReid-WMF) Hi @Dzahn - the experimentation platform dashboards use private data, and as such I'll need to be part of the group to work on the dashbo... [21:26:19] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:27:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T413525)', diff saved to https://phabricator.wikimedia.org/P87427 and previous config saved to /var/cache/conftool/dbconfig/20260112-212706-marostegui.json [21:27:10] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [21:30:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T413525)', diff saved to https://phabricator.wikimedia.org/P87428 and previous config saved to /var/cache/conftool/dbconfig/20260112-213041-marostegui.json [21:31:43] (03PS2) 10Clare Ming: Revert^2 "Deploy TestKitchen to Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225678 [21:32:32] (03CR) 10CI reject: [V:04-1] Revert^2 "Deploy TestKitchen to Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225678 (owner: 10Clare Ming) [21:35:47] (03PS3) 10Clare Ming: Revert^2 "Deploy TestKitchen to Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225678 [21:37:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P87429 and previous config saved to /var/cache/conftool/dbconfig/20260112-213714-marostegui.json [21:40:45] PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [21:40:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P87430 and previous config saved to /var/cache/conftool/dbconfig/20260112-214049-marostegui.json [21:41:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225678 (owner: 10Clare Ming) [21:41:35] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sretest2009.codfw.wmnet with reason: reboot [21:42:42] (03CR) 10Santiago Faci: [C:03+1] Revert^2 "Deploy TestKitchen to Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225678 (owner: 10Clare Ming) [21:42:45] RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.014 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [21:43:11] !log cwhite@deploy2002 Started deploy [statsv/statsv@b935e2d]: T389469 [21:43:15] T389469: No metrics from JS arriving in Prometheus/Graphite since around 11:48 UTC Wed. 2025-03-19 - https://phabricator.wikimedia.org/T389469 [21:43:20] (03CR) 10Clare Ming: "recheck" [extensions/TestKitchen] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225675 (https://phabricator.wikimedia.org/T407901) (owner: 10Clare Ming) [21:43:20] !log cwhite@deploy2002 Finished deploy [statsv/statsv@b935e2d]: T389469 (duration: 00m 09s) [21:47:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P87431 and previous config saved to /var/cache/conftool/dbconfig/20260112-214723-marostegui.json [21:50:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P87432 and previous config saved to /var/cache/conftool/dbconfig/20260112-215056-marostegui.json [21:57:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T413525)', diff saved to https://phabricator.wikimedia.org/P87433 and previous config saved to /var/cache/conftool/dbconfig/20260112-215731-marostegui.json [21:57:35] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [21:57:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [22:00:04] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T2200) [22:01:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T413525)', diff saved to https://phabricator.wikimedia.org/P87434 and previous config saved to /var/cache/conftool/dbconfig/20260112-220104-marostegui.json [22:02:25] (03CR) 10Kosta Harlan: [C:03+1] Undeploy Safety survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225664 (https://phabricator.wikimedia.org/T413022) (owner: 10DDesouza) [22:15:22] !log apt1002# reprepro --noskipold --restrict vopsbot update bookworm-wikimedia [22:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:05] so the backport window was just me running around in circles -- after reverting/aborting a config patch to beta, I realized we need to backport some changes first - https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TestKitchen/+/1225675 keeps failing with a seemingly unrelated test -- does anyone know why and/or how to fix? [22:18:16] looks like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OATHAuth/+/1224970 might need backporting [22:19:10] ugh - thank you - i think that's it [22:19:11] (03PS1) 10Zabe: tests: skip test when WebAuthn is not loaded [extensions/OATHAuth] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225687 (https://phabricator.wikimedia.org/T407797) [22:19:24] (03PS2) 10Zabe: Revert to `product_metrics` schemas and use `default` as the coordinator value [extensions/TestKitchen] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225675 (https://phabricator.wikimedia.org/T407901) (owner: 10Clare Ming) [22:22:41] (03CR) 10Clare Ming: [C:03+1] tests: skip test when WebAuthn is not loaded [extensions/OATHAuth] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225687 (https://phabricator.wikimedia.org/T407797) (owner: 10Zabe) [22:23:42] !incidents [22:23:42] No incidents occurred in the past 24 hours for team SRE [22:24:02] is the security team using the window? i'd love to finish some backports/config patches if not [22:24:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [22:24:43] ChrisDobbins901_, swfrench-wmf: about to fire off the test page I mentioned to you both about, no action needed [22:24:47] s/about// [22:24:57] ack [22:25:18] does anyone have an issue if i run a few more backports/config changes? [22:25:50] !ack [22:25:50] 7325 (ACKED) Manual (paged) by RLazarus (rlazarus@wikimedia.org): vopsbot test page, please ignore [22:26:07] !resolve [22:26:07] \o/ [22:26:08] 7325 (RESOLVED) Manual (paged) by RLazarus (rlazarus@wikimedia.org): vopsbot test page, please ignore [22:26:15] cool, that should be all better [22:26:16] nice [22:26:25] thanks for fixing that! [22:26:33] sorry for breaking it! 😅 [22:28:47] for the record, i don't consider silence === consent but I will backport some changes now if no one objects [22:30:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/OATHAuth] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225687 (https://phabricator.wikimedia.org/T407797) (owner: 10Zabe) [22:34:01] (03Merged) 10jenkins-bot: tests: skip test when WebAuthn is not loaded [extensions/OATHAuth] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225687 (https://phabricator.wikimedia.org/T407797) (owner: 10Zabe) [22:34:23] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1225687|tests: skip test when WebAuthn is not loaded (T407797)]] [22:34:27] T407797: Create a CI job to enforce tests to pass with solely required extensions - https://phabricator.wikimedia.org/T407797 [22:34:39] 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11514584 (10RKemper) Pushed out a patch to make the rebooting of hadoop workers smarter (namely, we can pass a cumin overr... [22:37:25] today is not my day [22:37:33] should i revert https://spiderpig.wikimedia.org/jobs/1170 ? [22:38:15] or rather https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OATHAuth/+/1225687 [22:39:42] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [22:39:43] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [22:41:06] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [22:41:06] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [22:47:31] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [22:47:32] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop analytics cluster [22:50:01] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [22:50:03] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1225687|tests: skip test when WebAuthn is not loaded (T407797)]] [22:50:07] T407797: Create a CI job to enforce tests to pass with solely required extensions - https://phabricator.wikimedia.org/T407797 [22:50:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1251.eqiad.wmnet with reason: Maintenance [22:50:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T413525)', diff saved to https://phabricator.wikimedia.org/P87435 and previous config saved to /var/cache/conftool/dbconfig/20260112-225015-marostegui.json [22:50:20] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [22:51:58] !log cjming@deploy2002 cjming, zabe: Backport for [[gerrit:1225687|tests: skip test when WebAuthn is not loaded (T407797)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:52:23] !log cjming@deploy2002 cjming, zabe: Continuing with sync [22:56:14] (03PS7) 10Ryan Kemper: hadoop.reboot-workers: make host override smarter [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) [22:56:28] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225687|tests: skip test when WebAuthn is not loaded (T407797)]] (duration: 06m 25s) [22:56:31] T407797: Create a CI job to enforce tests to pass with solely required extensions - https://phabricator.wikimedia.org/T407797 [22:56:56] 06SRE, 06Data-Engineering (Q3 FY25/26 January 1st - March 31th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994#11514655 (10Ahoelzl) a:05amastilovic→03None [22:58:43] Web Team: if you're not using your window, I would like to do one more backport and one more config change [23:01:00] (03CR) 10CI reject: [V:04-1] hadoop.reboot-workers: make host override smarter [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) (owner: 10Ryan Kemper) [23:01:09] i'll give it 5 minutes before diving into more backport/config changes [23:01:37] 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11514664 (10RKemper) Dry run logic looked good, rebooting the remaining an-worker hosts for real now [23:05:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225675 (https://phabricator.wikimedia.org/T407901) (owner: 10Clare Ming) [23:08:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87436 and previous config saved to /var/cache/conftool/dbconfig/20260112-230801-marostegui.json [23:08:07] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [23:08:08] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [23:14:16] (03Merged) 10jenkins-bot: Revert to `product_metrics` schemas and use `default` as the coordinator value [extensions/TestKitchen] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1225675 (https://phabricator.wikimedia.org/T407901) (owner: 10Clare Ming) [23:14:36] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1225675|Revert to `product_metrics` schemas and use `default` as the coordinator value (T407901)]] [23:14:40] T407901: Update Schema Events - https://phabricator.wikimedia.org/T407901 [23:16:26] !log cjming@deploy2002 cjming: Backport for [[gerrit:1225675|Revert to `product_metrics` schemas and use `default` as the coordinator value (T407901)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:16:49] !log cjming@deploy2002 cjming: Continuing with sync [23:18:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P87437 and previous config saved to /var/cache/conftool/dbconfig/20260112-231809-marostegui.json [23:20:49] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225675|Revert to `product_metrics` schemas and use `default` as the coordinator value (T407901)]] (duration: 06m 13s) [23:20:53] T407901: Update Schema Events - https://phabricator.wikimedia.org/T407901 [23:21:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T413525)', diff saved to https://phabricator.wikimedia.org/P87438 and previous config saved to /var/cache/conftool/dbconfig/20260112-232144-marostegui.json [23:21:49] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [23:22:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225678 (owner: 10Clare Ming) [23:23:56] (03Merged) 10jenkins-bot: Revert^2 "Deploy TestKitchen to Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225678 (owner: 10Clare Ming) [23:24:15] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1225678|Revert^2 "Deploy TestKitchen to Beta Cluster"]] [23:26:08] !log cjming@deploy2002 cjming: Backport for [[gerrit:1225678|Revert^2 "Deploy TestKitchen to Beta Cluster"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:26:31] !log cjming@deploy2002 cjming: Continuing with sync [23:26:36] (03CR) 10Eevans: [C:03+1] Remove profile::puppet::agent::force_puppet7 from Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225525 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [23:28:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P87439 and previous config saved to /var/cache/conftool/dbconfig/20260112-232817-marostegui.json [23:30:29] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225678|Revert^2 "Deploy TestKitchen to Beta Cluster"]] (duration: 06m 14s) [23:31:41] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker2251.codfw.wmnet, wikikube-worker2035.codfw.wmnet, wikikube-worker2038.codfw.wmnet, wikikube-worker2088.codfw.wmnet, wikikube-worker2018.codfw.wmnet, wikikube-worker2328.codfw.wmnet, wikikube-worker2141.codfw.wmnet, wikikube-worker2292.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2124.codfw. [23:31:41] ikikube-worker2313.codfw.wmnet, wikikube-worker2144.codfw.wmnet, wikikube-worker2298.codfw.wmnet, wikikube-worker2202.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2127.codfw.wmnet, wikikube-worker2112.codfw.wmnet, wikikube-worker2317.codfw.wmnet, wikikube-worker2172.codfw.wmnet, wikikube-worker2108.codfw.wmnet, wikikube-worker2109.codfw.wmnet, wikikube-worker2189.codfw.wmnet, wikikube-worker2037.codfw.wmnet, wikikube-worke [23:31:41] dfw.wmnet, wikikube-worker2212.codfw.wmnet, wikikube-worker2113.codfw.wmnet, wikikube-worker2275.codfw.wmnet, wikikube-worker2185.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [23:31:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P87440 and previous config saved to /var/cache/conftool/dbconfig/20260112-233152-marostegui.json [23:31:57] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:32:14] !incidents [23:32:15] 7326 (UNACKED) ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw) [23:32:15] 7325 (RESOLVED) Manual (paged) by RLazarus (rlazarus@wikimedia.org): vopsbot test page, please ignore [23:32:28] !ack [23:32:29] 7326 (ACKED) ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw) [23:32:41] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker2262.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2202.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2036.codfw.wmnet, wikikube-worker2280.codfw.wmnet, wikikube-worker2328.codfw.wmnet, wikikube-worker2150.codfw.wmnet, wikikube-worker2113.codfw. [23:32:41] ikikube-worker2136.codfw.wmnet, wikikube-worker2185.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2274.codfw.wmnet, wikikube-worker2132.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2190.codfw.wmnet, wikikube-worker2177.codfw.wmnet, wikikube-worker2311.codfw.wmnet, wikikube-worker2287.codfw.wmnet, wikikube-worker2161.codfw.wmnet, wikikube-worker2198.codfw.wmnet, wikikube-worke [23:32:41] dfw.wmnet, wikikube-worker2297.codfw.wmnet, wikikube-worker2315.codfw.wmnet, wikikube-worker2322.codfw.wmnet, wikikube-worker2058.codfw.wmnet, wikikube-worker2314.codfw.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [23:32:45] I'm guess we just ran a lot of deployments back to back? [23:33:52] I'm here, btw [23:33:56] yup, shellbox-video is fully unavailable: https://grafana.wikimedia.org/goto/VGZ4kj4DR?orgId=1 [23:34:11] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:34:12] i'm finally done - hope i didn't cause issues [23:34:34] ChrisDobbins901_: so, what's likely happening is that the large number of back-to-back deployments has increased the demand on shellbox-video for video transcodes [23:34:41] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:34:42] we should probably just scale it up [23:34:57] cjming: not your fault, just an unusual architectural shortcoming :) [23:35:07] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:35:08] Thanks, swfrench-wmf [23:35:10] gtk [23:35:41] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:35:41] ChrisDobbins901_: I'm going to try increasing the number of shellbox-video replicas in codfw. hopefully that should unblock things. [23:36:57] ack. is that a Puppet config change? [23:36:57] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:21] ChrisDobbins901_: so, it's a deployment-charts change to the helmfile values for the service. what I would likely do is edit the files manually on the deployment host and apply that to the live service, then once that looked good, I would post a proper patch to "lock in" that edit. [23:38:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87441 and previous config saved to /var/cache/conftool/dbconfig/20260112-233825-marostegui.json [23:38:31] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [23:38:32] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [23:38:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2245.codfw.wmnet with reason: Maintenance [23:38:46] however, it seems things are recovering without taking any action on my part, so I might hold and see how things are evolving :) [23:38:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2245 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87442 and previous config saved to /var/cache/conftool/dbconfig/20260112-233850-marostegui.json [23:39:01] gotcha and thank you for the explanation [23:40:09] ChrisDobbins901_: for reference, that would be this [0], which exists on the deployment hosts at `/srv/deployment-charts/helmfile.d/services/shellbox-video/values.yaml`. [23:40:10] [0] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/shellbox-video/values.yaml#60 [23:41:59] * swfrench-wmf hopes that some day we'll have HPA enabled on shellbox-video and not have to think about this [23:42:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P87443 and previous config saved to /var/cache/conftool/dbconfig/20260112-234201-marostegui.json [23:42:23] that sounds lovely [23:42:53] (HPA, or, videoscalerscaler) [23:44:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [23:46:51] but who scales the videoscalerscaler [23:47:36] :) [23:48:03] : [23:48:08] sorry, wrong window [23:50:47] (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225709 (https://phabricator.wikimedia.org/T344471) [23:52:00] (03CR) 10DDesouza: [C:03+1] Undeploy Safety survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225664 (https://phabricator.wikimedia.org/T413022) (owner: 10DDesouza) [23:52:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T413525)', diff saved to https://phabricator.wikimedia.org/P87444 and previous config saved to /var/cache/conftool/dbconfig/20260112-235209-marostegui.json [23:52:14] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [23:52:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [23:53:58] jouncebot: nowandnext [23:53:58] For the next 0 hour(s) and 6 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260112T2200) [23:53:58] In 0 hour(s) and 6 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T0000) [23:54:29] zabe: if you could please hold on deploying (if that's what you have in mind) for a few minutes, that would be greatly appreciated. [23:54:36] ofc