[00:10:25] FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:13:59] (03PS1) 10Dzahn: gerrit: set java_home and migration user in repo Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1026195 (https://phabricator.wikimedia.org/T363196) [00:16:51] (03PS1) 10Dzahn: devtools: update gerrit and phab instance names in default Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1026197 (https://phabricator.wikimedia.org/T363196) [00:17:28] (03PS2) 10Dzahn: devtools: update gerrit and phab instance names in default Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1026197 (https://phabricator.wikimedia.org/T363196) [00:17:53] (03PS3) 10Dzahn: mediawiki/geoip: make loading geoip data from puppetserver optional [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) [00:18:13] (03CR) 10CI reject: [V:04-1] mediawiki/geoip: make loading geoip data from puppetserver optional [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [00:19:08] (03PS4) 10Dzahn: mediawiki/geoip: make loading geoip data from puppetserver optional [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) [00:24:00] 06SRE, 06serviceops, 13Patch-For-Review: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415#9762658 (10Dzahn) The gervert issue (can't find gerrit dsh group) appears to come from https://gerrit.wikimedia.org/r/c/operations/software/gerrit/... [00:27:48] (03PS1) 10Dzahn: devtools: remove gervert from deployed repos in cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1026198 (https://phabricator.wikimedia.org/T363415) [00:30:14] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 148 probes of 732 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:32:30] FIRING: [2x] Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:35:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 38 probes of 732 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:52:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 97 probes of 732 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:52:30] RESOLVED: [2x] Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:56:18] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 101 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:00:27] FIRING: ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:02:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:03:53] RESOLVED: ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:14] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 37 probes of 732 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:16:16] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 42 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:23:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:42:30] FIRING: [2x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b3-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [01:45:56] 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9762680 (10tstarling) >>! In T345334#9752173, @Ladsgroup wrote: > That'd work on overall hits, as you said "sort images by popularity". That's not the case her... [02:15:41] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:38:54] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:54] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:27] FIRING: ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:48:53] RESOLVED: ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:10:25] FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:27:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1181.eqiad.wmnet with reason: Maintenance [04:28:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1181.eqiad.wmnet with reason: Maintenance [04:29:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2099.codfw.wmnet with reason: Maintenance [04:29:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2099.codfw.wmnet with reason: Maintenance [04:30:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s7 T363892 [04:30:14] T363892: Switchover s7 master (db1181 -> db1236) - https://phabricator.wikimedia.org/T363892 [04:30:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1236 with weight 0 T363892', diff saved to https://phabricator.wikimedia.org/P61645 and previous config saved to /var/cache/conftool/dbconfig/20240502-043019-marostegui.json [04:30:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T363892 [04:31:23] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1025906 (https://phabricator.wikimedia.org/T363892) (owner: 10Gerrit maintenance bot) [04:34:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2162', diff saved to https://phabricator.wikimedia.org/P61646 and previous config saved to /var/cache/conftool/dbconfig/20240502-043403-root.json [04:34:05] (03PS1) 10Marostegui: db2162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026250 [04:34:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2162.codfw.wmnet with reason: Reimage [04:34:35] (03CR) 10Marostegui: [C:03+2] db2162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026250 (owner: 10Marostegui) [04:34:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2162.codfw.wmnet with reason: Reimage [04:35:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2162.codfw.wmnet with OS bookworm [04:39:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2136.codfw.wmnet with reason: Maintenance [04:40:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2136.codfw.wmnet with reason: Maintenance [04:40:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T361627)', diff saved to https://phabricator.wikimedia.org/P61647 and previous config saved to /var/cache/conftool/dbconfig/20240502-044020-marostegui.json [04:40:24] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [04:40:36] (03PS1) 10Marostegui: db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026251 [04:41:00] (03CR) 10Marostegui: [C:03+2] db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026251 (owner: 10Marostegui) [04:43:04] (03PS1) 10Marostegui: db1234: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1026252 [04:43:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:45:27] (03CR) 10Marostegui: [C:03+2] db1234: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1026252 (owner: 10Marostegui) [04:48:00] !log Starting s7 eqiad failover from db1181 to db1236 - T363892 [04:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:04] T363892: Switchover s7 master (db1181 -> db1236) - https://phabricator.wikimedia.org/T363892 [04:48:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T363892', diff saved to https://phabricator.wikimedia.org/P61648 and previous config saved to /var/cache/conftool/dbconfig/20240502-044819-marostegui.json [04:48:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1236 to s7 primary and set section read-write T363892', diff saved to https://phabricator.wikimedia.org/P61649 and previous config saved to /var/cache/conftool/dbconfig/20240502-044848-marostegui.json [04:49:14] (03CR) 10Marostegui: [C:03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1025907 (https://phabricator.wikimedia.org/T363892) (owner: 10Gerrit maintenance bot) [04:49:17] (03PS2) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1025907 (https://phabricator.wikimedia.org/T363892) [04:49:49] (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1025907 (https://phabricator.wikimedia.org/T363892) (owner: 10Gerrit maintenance bot) [04:50:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1181 T363892', diff saved to https://phabricator.wikimedia.org/P61650 and previous config saved to /var/cache/conftool/dbconfig/20240502-045017-root.json [04:51:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T361627)', diff saved to https://phabricator.wikimedia.org/P61651 and previous config saved to /var/cache/conftool/dbconfig/20240502-045131-marostegui.json [04:51:35] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [04:52:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1181.eqiad.wmnet with OS bookworm [04:53:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2162.codfw.wmnet with reason: host reimage [04:55:20] (03CR) 10Marostegui: [C:03+1] "es6 and es7 are now set up although still not getting writes." [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [04:55:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2162.codfw.wmnet with reason: host reimage [05:03:38] (03PS1) 10Marostegui: Revert "db2162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026138 [05:04:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: host reimage [05:04:24] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:44] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P61652 and previous config saved to /var/cache/conftool/dbconfig/20240502-050639-marostegui.json [05:06:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: host reimage [05:11:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61653 and previous config saved to /var/cache/conftool/dbconfig/20240502-051155-root.json [05:11:57] (03CR) 10Marostegui: [C:03+2] Revert "db2162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026138 (owner: 10Marostegui) [05:13:21] (03PS1) 10Marostegui: Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026139 [05:17:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2162.codfw.wmnet with OS bookworm [05:21:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [05:21:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [05:21:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P61654 and previous config saved to /var/cache/conftool/dbconfig/20240502-052146-marostegui.json [05:21:55] (03CR) 10Marostegui: [C:03+2] Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026139 (owner: 10Marostegui) [05:27:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61655 and previous config saved to /var/cache/conftool/dbconfig/20240502-052700-root.json [05:27:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:27:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1181.eqiad.wmnet with OS bookworm [05:29:07] (03PS1) 10Marostegui: db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026259 (https://phabricator.wikimedia.org/T363119) [05:29:43] (03CR) 10Marostegui: [C:03+2] db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026259 (https://phabricator.wikimedia.org/T363119) (owner: 10Marostegui) [05:36:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T361627)', diff saved to https://phabricator.wikimedia.org/P61656 and previous config saved to /var/cache/conftool/dbconfig/20240502-053654-marostegui.json [05:36:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2137.codfw.wmnet with reason: Maintenance [05:36:57] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:37:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2137.codfw.wmnet with reason: Maintenance [05:37:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T361627)', diff saved to https://phabricator.wikimedia.org/P61657 and previous config saved to /var/cache/conftool/dbconfig/20240502-053717-marostegui.json [05:42:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61658 and previous config saved to /var/cache/conftool/dbconfig/20240502-054206-root.json [05:42:30] FIRING: [2x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b3-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [05:44:01] (03PS1) 10Marostegui: db2152: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026261 [05:48:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [05:48:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [05:48:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T361627)', diff saved to https://phabricator.wikimedia.org/P61659 and previous config saved to /var/cache/conftool/dbconfig/20240502-054821-marostegui.json [05:48:24] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:57:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61660 and previous config saved to /var/cache/conftool/dbconfig/20240502-055712-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T0600). [06:03:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:03:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P61661 and previous config saved to /var/cache/conftool/dbconfig/20240502-060328-marostegui.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61662 and previous config saved to /var/cache/conftool/dbconfig/20240502-061218-root.json [06:15:41] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:18:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P61663 and previous config saved to /var/cache/conftool/dbconfig/20240502-061836-marostegui.json [06:27:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61664 and previous config saved to /var/cache/conftool/dbconfig/20240502-062725-root.json [06:33:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T361627)', diff saved to https://phabricator.wikimedia.org/P61665 and previous config saved to /var/cache/conftool/dbconfig/20240502-063343-marostegui.json [06:33:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2139.codfw.wmnet with reason: Maintenance [06:33:47] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:33:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2139.codfw.wmnet with reason: Maintenance [06:42:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61666 and previous config saved to /var/cache/conftool/dbconfig/20240502-064230-root.json [06:45:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2147.codfw.wmnet with reason: Maintenance [06:45:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2147.codfw.wmnet with reason: Maintenance [06:45:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T361627)', diff saved to https://phabricator.wikimedia.org/P61667 and previous config saved to /var/cache/conftool/dbconfig/20240502-064533-marostegui.json [06:45:36] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:50:03] (03CR) 10Muehlenhoff: [C:04-1] "The UID for shell must be identical to the username used for LDAP, i.e. it should also use jsn." [puppet] - 10https://gerrit.wikimedia.org/r/1025847 (https://phabricator.wikimedia.org/T363377) (owner: 10Eevans) [06:52:19] (03CR) 10Muehlenhoff: [C:03+2] Harmonise analytics Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1024625 (owner: 10Muehlenhoff) [06:52:50] (03CR) 10Muehlenhoff: [C:03+2] Adapt cookbooks to new Cumin aliases for analytics hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1025679 (owner: 10Muehlenhoff) [06:53:20] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1025777 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [06:53:49] (03CR) 10Muehlenhoff: [C:03+2] sre.wdqs.restart-nginx: Also restart Envoy alongside [cookbooks] - 10https://gerrit.wikimedia.org/r/1023863 (owner: 10Muehlenhoff) [06:57:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T361627)', diff saved to https://phabricator.wikimedia.org/P61668 and previous config saved to /var/cache/conftool/dbconfig/20240502-065758-marostegui.json [06:58:02] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:58:54] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:13:05] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-test [07:13:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P61669 and previous config saved to /var/cache/conftool/dbconfig/20240502-071305-marostegui.json [07:13:09] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:15] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-test [07:16:43] (03CR) 10LSobanski: [C:03+1] lists: Add collaboration services as owner [puppet] - 10https://gerrit.wikimedia.org/r/1026157 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [07:21:38] (03PS1) 10Muehlenhoff: sre.wdqs.restart-nginx-envoy: Also change name in usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/1026437 [07:23:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:25:21] (03PS2) 10Muehlenhoff: sre.wdqs.restart-nginx-envoy: Also change name in usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/1026437 [07:28:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P61670 and previous config saved to /var/cache/conftool/dbconfig/20240502-072813-marostegui.json [07:29:37] (03CR) 10Muehlenhoff: [C:03+2] sre.wdqs.restart-nginx-envoy: Also change name in usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/1026437 (owner: 10Muehlenhoff) [07:30:36] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1025775 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [07:36:12] (03PS1) 10Muehlenhoff: Remove obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1026438 (https://phabricator.wikimedia.org/T360439) [07:37:49] (03PS1) 10Muehlenhoff: Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1026439 (https://phabricator.wikimedia.org/T360439) [07:38:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026438 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [07:38:41] !log volans@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: Update Netbox dependencies for netbox - volans@cumin1002 [07:38:58] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-internal [07:40:53] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2123.codfw.wmnet [07:42:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-internal [07:43:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T361627)', diff saved to https://phabricator.wikimedia.org/P61671 and previous config saved to /var/cache/conftool/dbconfig/20240502-074320-marostegui.json [07:43:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [07:43:26] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:43:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [07:43:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:43:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:44:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T361627)', diff saved to https://phabricator.wikimedia.org/P61672 and previous config saved to /var/cache/conftool/dbconfig/20240502-074400-marostegui.json [07:44:07] !log volans@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: Update Netbox dependencies for netbox - volans@cumin1002 [07:46:55] (03PS1) 10JMeybohm: wikifunctions: Allow prometheus to scrape metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026441 (https://phabricator.wikimedia.org/T350034) [07:47:30] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-public [07:47:53] !log installing Java 8 security updates [07:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:17] !log brouberol@cumin1002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [07:48:37] (03PS1) 10Muehlenhoff: Switch db2123 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026443 (https://phabricator.wikimedia.org/T349619) [07:54:07] (03CR) 10JMeybohm: [C:03+2] cfssl::cert: Add before_services parameter [puppet] - 10https://gerrit.wikimedia.org/r/1025690 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [07:54:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T361627)', diff saved to https://phabricator.wikimedia.org/P61673 and previous config saved to /var/cache/conftool/dbconfig/20240502-075455-marostegui.json [07:54:58] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:55:06] (03CR) 10Muehlenhoff: [C:03+2] Switch db2123 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026443 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:56:38] (03CR) 10DCausse: "lgtm, couple nits and small issues in the dashboard URL" [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [07:56:39] !log brouberol@cumin1002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [07:57:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-public [08:00:05] jnuche and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T0800). [08:00:37] morning, similarly to yesterday there are lots of people out today, so I'm pushing the train rollout to later in the day [08:00:46] I'm aiming for 14:00 UTC/07:00 PDT, right after the UTC afternoon backport window [08:02:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2123.codfw.wmnet [08:08:30] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2157.codfw.wmnet [08:10:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P61674 and previous config saved to /var/cache/conftool/dbconfig/20240502-081002-marostegui.json [08:10:25] FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:14:08] (03PS1) 10JMeybohm: Remove mw2382 as kubernetes node to prevent scap failures [puppet] - 10https://gerrit.wikimedia.org/r/1026446 (https://phabricator.wikimedia.org/T362938) [08:14:53] 10ops-codfw, 06SRE, 06serviceops, 13Patch-For-Review: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9763009 (10JMeybohm) >>! In T362938#9761317, @jnuche wrote: > Scap failed to connect to this host today during the MediaWiki train while trying to preload the MW image: > `15:08:17 /usr... [08:14:57] (03CR) 10Volans: "Please don't forget to follow https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Renaming/Deleting_a_cookbook when renaming cookbooks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023863 (owner: 10Muehlenhoff) [08:15:21] (03CR) 10JMeybohm: [C:03+2] Remove mw2382 as kubernetes node to prevent scap failures [puppet] - 10https://gerrit.wikimedia.org/r/1026446 (https://phabricator.wikimedia.org/T362938) (owner: 10JMeybohm) [08:25:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P61675 and previous config saved to /var/cache/conftool/dbconfig/20240502-082510-marostegui.json [08:25:13] 10ops-codfw, 06SRE, 06serviceops, 13Patch-For-Review: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9763050 (10jnuche) >> Would it be possibly to remove it temporarily from the list of K8s workers while work is done on it? > > Will do...but I think the right thing to do here is to f... [08:25:35] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:25:45] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:32:25] (03PS1) 10Muehlenhoff: Switch db2157 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026450 (https://phabricator.wikimedia.org/T349619) [08:36:31] (03CR) 10Muehlenhoff: [C:03+2] Switch db2157 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026450 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:38:43] (03CR) 10Jbond: [C:04-1] "I don't think this is the way to go as it will slow down deployments. you should instead add `oob_` to the list of ignored_branch prefixe" [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (owner: 10Andrew Bogott) [08:39:53] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:40:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T361627)', diff saved to https://phabricator.wikimedia.org/P61676 and previous config saved to /var/cache/conftool/dbconfig/20240502-084018-marostegui.json [08:40:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [08:40:21] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:40:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [08:40:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T361627)', diff saved to https://phabricator.wikimedia.org/P61677 and previous config saved to /var/cache/conftool/dbconfig/20240502-084041-marostegui.json [08:46:54] (03CR) 10JMeybohm: [C:03+2] Use a blocksize of /28 for staging-eqiad ipv4 pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025804 (https://phabricator.wikimedia.org/T345823) (owner: 10JMeybohm) [08:48:06] (03CR) 10Marostegui: [C:03+2] db2152: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026261 (owner: 10Marostegui) [08:49:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2152.codfw.wmnet with OS bookworm [08:50:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2157.codfw.wmnet [08:52:01] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:52:13] that's me [08:52:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T361627)', diff saved to https://phabricator.wikimedia.org/P61679 and previous config saved to /var/cache/conftool/dbconfig/20240502-085241-marostegui.json [08:52:44] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:53:01] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:53:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [08:54:26] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1026451 (https://phabricator.wikimedia.org/T135991) [08:57:01] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:58:01] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:58:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:58:54] (03PS1) 10Muehlenhoff: cloudweb: Enable profile::auto_restarts::service for FPM [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) [08:58:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [08:59:01] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2171.codfw.wmnet [08:59:56] (03PS1) 10Muehlenhoff: Switch db2171 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026454 (https://phabricator.wikimedia.org/T349619) [09:00:40] 06SRE, 06Commons, 10MediaWiki-File-management, 06serviceops, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155#9763167 (10IagoQnsi) >>! In T266155#9485678, @Bawolff wrote: > Just trying to think up solutions - if th... [09:02:09] !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: sync on main [09:02:48] (03CR) 10Muehlenhoff: [C:03+2] Switch db2171 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026454 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:02:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:03:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:03:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:03:46] (03PS1) 10Slyngshede: API Tokens: Allow authorized users to manage their API tokens. [software/bitu] - 10https://gerrit.wikimedia.org/r/1026458 [09:03:53] (03PS1) 10Muehlenhoff: cloudweb: Enable profile::auto_restarts::service for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1026459 (https://phabricator.wikimedia.org/T135991) [09:04:03] (03PS2) 10Hnowlan: trafficserver: move 85% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1026159 (https://phabricator.wikimedia.org/T362323) [09:06:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2152.codfw.wmnet with reason: host reimage [09:07:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P61680 and previous config saved to /var/cache/conftool/dbconfig/20240502-090748-marostegui.json [09:08:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:08:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2171.codfw.wmnet [09:08:46] (03CR) 10Muehlenhoff: [C:03+2] "Ack. There was nothing to cleanup (unless you did it already)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023863 (owner: 10Muehlenhoff) [09:09:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2152.codfw.wmnet with reason: host reimage [09:10:26] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cephosd1005.eqiad.wmnet with OS bookworm [09:11:26] (03CR) 10Volans: "No, I didn't. And I can see both old and new files still there:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023863 (owner: 10Muehlenhoff) [09:11:56] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1025913 (https://phabricator.wikimedia.org/T363977) [09:12:07] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [09:13:13] !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [09:13:54] !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main [09:16:05] (03CR) 10Muehlenhoff: [C:03+2] "Ah, I looked in the wrong directory. Now cleaned up, sorry." [cookbooks] - 10https://gerrit.wikimedia.org/r/1023863 (owner: 10Muehlenhoff) [09:16:10] PROBLEM - Router interfaces on cr2-magru is CRITICAL: CRITICAL: host 195.200.68.129, interfaces up: 45, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:16:12] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:16:12] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:16:32] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:16:34] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:16:45] (03PS1) 10Marostegui: Revert "db2152: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026397 [09:17:43] (03CR) 10Volans: "I'm worried it might make running cookbooks break. But yes we could automate this in some way." [cookbooks] - 10https://gerrit.wikimedia.org/r/1023863 (owner: 10Muehlenhoff) [09:18:00] !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [09:18:21] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Set up Ganeti clusters in magru - https://phabricator.wikimedia.org/T363978 (10MoritzMuehlenhoff) 03NEW [09:18:34] !log depooling 5 appservers in advance of migrating them to k8s workers [09:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Set up Ganeti clusters in magru - https://phabricator.wikimedia.org/T363978#9763233 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [09:22:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P61681 and previous config saved to /var/cache/conftool/dbconfig/20240502-092256-marostegui.json [09:23:32] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:23:34] RECOVERY - BFD status on cr2-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:24:10] RECOVERY - Router interfaces on cr2-magru is OK: OK: host 195.200.68.129, interfaces up: 46, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:24:14] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:14] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61682 and previous config saved to /var/cache/conftool/dbconfig/20240502-092454-root.json [09:25:32] (03CR) 10Marostegui: [C:03+2] Revert "db2152: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026397 (owner: 10Marostegui) [09:26:41] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2178.codfw.wmnet [09:27:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:40] (03PS1) 10Muehlenhoff: Switch db2178 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026464 (https://phabricator.wikimedia.org/T349619) [09:28:30] PROBLEM - Host mw2382 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2152.codfw.wmnet with OS bookworm [09:29:50] (03PS4) 10Btullis: Add the commons impact metrics dumps fetcher and readme [puppet] - 10https://gerrit.wikimedia.org/r/1026162 (https://phabricator.wikimedia.org/T358701) [09:30:05] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026162 (https://phabricator.wikimedia.org/T358701) (owner: 10Btullis) [09:31:33] FIRING: KubernetesCalicoDown: mw2382.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2382.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:32:16] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage [09:34:08] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2227/co" [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [09:35:22] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage [09:38:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T361627)', diff saved to https://phabricator.wikimedia.org/P61683 and previous config saved to /var/cache/conftool/dbconfig/20240502-093803-marostegui.json [09:38:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:38:08] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:38:15] (03PS1) 10Marostegui: es6 hosts: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026493 [09:38:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:38:20] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw2382.codfw.wmnet with reason: Degraded RAID/storage controller issues [09:38:24] (03CR) 10Muehlenhoff: [C:03+2] Switch db2178 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026464 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:38:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T361627)', diff saved to https://phabricator.wikimedia.org/P61684 and previous config saved to /var/cache/conftool/dbconfig/20240502-093827-marostegui.json [09:38:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw2382.codfw.wmnet with reason: Degraded RAID/storage controller issues [09:38:38] (03CR) 10Marostegui: [C:03+2] es6 hosts: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026493 (owner: 10Marostegui) [09:38:41] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9763294 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e3dd1140-411c-45b4-a1c6-3961f47c4f12) set by jayme@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Degra... [09:40:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61685 and previous config saved to /var/cache/conftool/dbconfig/20240502-094000-root.json [09:42:11] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:42:17] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:42:17] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:42:17] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:42:24] (03PS1) 10Effie Mouzeli: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026495 (https://phabricator.wikimedia.org/T287491) [09:42:30] FIRING: [2x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b3-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [09:42:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2178.codfw.wmnet [09:49:13] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:49:17] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:49:19] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:49:19] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:50:02] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2192.codfw.wmnet [09:50:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T361627)', diff saved to https://phabricator.wikimedia.org/P61686 and previous config saved to /var/cache/conftool/dbconfig/20240502-095038-marostegui.json [09:50:42] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:52:07] (03PS1) 10Muehlenhoff: Switch db2192 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026499 (https://phabricator.wikimedia.org/T349619) [09:53:53] (03CR) 10Muehlenhoff: [C:03+2] Switch db2192 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026499 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:54:23] !log installing util-linux security updates [09:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61687 and previous config saved to /var/cache/conftool/dbconfig/20240502-095506-root.json [09:58:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2192.codfw.wmnet [09:58:54] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2211.codfw.wmnet [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1000) [10:00:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1005.eqiad.wmnet with OS bookworm [10:03:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:05:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P61688 and previous config saved to /var/cache/conftool/dbconfig/20240502-100546-marostegui.json [10:10:10] (03PS1) 10Muehlenhoff: Switch db2211 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026504 (https://phabricator.wikimedia.org/T349619) [10:10:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61689 and previous config saved to /var/cache/conftool/dbconfig/20240502-101012-root.json [10:11:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [10:11:19] (03CR) 10Muehlenhoff: [C:03+2] Switch db2211 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026504 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:15:27] (03PS2) 10Effie Mouzeli: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026495 (https://phabricator.wikimedia.org/T287491) [10:15:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2211.codfw.wmnet [10:15:41] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:16:20] (03PS3) 10Effie Mouzeli: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026495 (https://phabricator.wikimedia.org/T287491) [10:17:44] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Add collaboration services as owner [puppet] - 10https://gerrit.wikimedia.org/r/1026157 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [10:20:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [10:20:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P61690 and previous config saved to /var/cache/conftool/dbconfig/20240502-102053-marostegui.json [10:22:10] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2213.codfw.wmnet [10:24:11] (03PS1) 10Muehlenhoff: Switch db2213 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026505 (https://phabricator.wikimedia.org/T349619) [10:25:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61691 and previous config saved to /var/cache/conftool/dbconfig/20240502-102518-root.json [10:30:46] (03PS1) 10Gmodena: EventStreamConfig: Add webrequest.frontend.v1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026506 (https://phabricator.wikimedia.org/T314956) [10:31:29] (03CR) 10CI reject: [V:04-1] EventStreamConfig: Add webrequest.frontend.v1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026506 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [10:31:44] (03CR) 10Muehlenhoff: [C:03+2] Switch db2213 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026505 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:31:48] (03CR) 10JMeybohm: [C:03+1] admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026495 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [10:34:10] (03CR) 10Effie Mouzeli: [C:03+2] admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026495 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [10:35:00] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:35:34] (03PS2) 10Gmodena: EventStreamConfig: Add webrequest.frontend.v1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026506 (https://phabricator.wikimedia.org/T314956) [10:36:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T361627)', diff saved to https://phabricator.wikimedia.org/P61692 and previous config saved to /var/cache/conftool/dbconfig/20240502-103601-marostegui.json [10:36:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2199.codfw.wmnet with reason: Maintenance [10:36:04] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:36:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2199.codfw.wmnet with reason: Maintenance [10:36:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2213.codfw.wmnet [10:37:12] (03Merged) 10jenkins-bot: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026495 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [10:37:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for ganeti01/magru - jmm@cumin2002" [10:38:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for ganeti01/magru - jmm@cumin2002" [10:38:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:40:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61693 and previous config saved to /var/cache/conftool/dbconfig/20240502-104024-root.json [10:40:29] (03CR) 10Mforns: [C:03+1] "Thank you all for this!" [puppet] - 10https://gerrit.wikimedia.org/r/1026162 (https://phabricator.wikimedia.org/T358701) (owner: 10Btullis) [10:42:14] (03PS1) 10Muehlenhoff: Make ganeti7001 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026508 (https://phabricator.wikimedia.org/T363978) [10:42:54] (03CR) 10Slyngshede: [V:03+1 C:03+2] CloudIDM, Install Bitu for labtest [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [10:42:55] (03CR) 10Btullis: [C:03+2] "OK, thanks. I'll merge as-is to make sure that the hdfs-rsync works, then we can adjust the text and links later, in a separate commit." [puppet] - 10https://gerrit.wikimedia.org/r/1026162 (https://phabricator.wikimedia.org/T358701) (owner: 10Btullis) [10:43:10] (03CR) 10Muehlenhoff: [C:03+2] Make ganeti7001 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026508 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [10:43:45] (03PS1) 10MVernon: ceph: Also install python3-bcrypt [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026509 [10:45:08] (03PS1) 10Effie Mouzeli: admin_ng/cert-manager: fix misplaced key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026510 [10:45:29] (03CR) 10Btullis: [C:03+1] ceph: Also install python3-bcrypt [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026509 (owner: 10MVernon) [10:46:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2206.codfw.wmnet with reason: Maintenance [10:46:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2206.codfw.wmnet with reason: Maintenance [10:46:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T361627)', diff saved to https://phabricator.wikimedia.org/P61694 and previous config saved to /var/cache/conftool/dbconfig/20240502-104658-marostegui.json [10:47:01] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:47:23] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1183.eqiad.wmnet [10:47:30] (03PS1) 10Esanders: Pre-emptively disable DiscussionToolsEnableThanks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 [10:47:46] (03PS2) 10Esanders: Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 [10:48:24] (03CR) 10CI reject: [V:04-1] Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders) [10:49:01] (03CR) 10Effie Mouzeli: [C:03+2] admin_ng/cert-manager: fix misplaced key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026510 (owner: 10Effie Mouzeli) [10:49:24] (03CR) 10MVernon: [V:03+2 C:03+2] ceph: Also install python3-bcrypt [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026509 (owner: 10MVernon) [10:49:38] (03PS1) 10Muehlenhoff: Add Hiera config for ganeti7001 [puppet] - 10https://gerrit.wikimedia.org/r/1026512 (https://phabricator.wikimedia.org/T363978) [10:50:23] (03CR) 10Muehlenhoff: [C:03+2] Add Hiera config for ganeti7001 [puppet] - 10https://gerrit.wikimedia.org/r/1026512 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [10:51:20] (03PS1) 10Muehlenhoff: Switch db1183 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026513 (https://phabricator.wikimedia.org/T349619) [10:52:02] (03Merged) 10jenkins-bot: admin_ng/cert-manager: fix misplaced key [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026510 (owner: 10Effie Mouzeli) [10:53:20] (03CR) 10Muehlenhoff: [C:03+2] Switch db1183 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026513 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:55:06] (03CR) 10Hnowlan: [C:03+2] k8s: move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1026158 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [10:55:19] (03PS1) 10Ssingh: Revert "magru: depool geoip/text*" [dns] - 10https://gerrit.wikimedia.org/r/1026484 [10:55:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61695 and previous config saved to /var/cache/conftool/dbconfig/20240502-105530-root.json [10:58:54] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:59:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T361627)', diff saved to https://phabricator.wikimedia.org/P61696 and previous config saved to /var/cache/conftool/dbconfig/20240502-105903-marostegui.json [10:59:07] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:59:26] (03PS1) 10Hnowlan: scap: make mw1407 a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1026520 (https://phabricator.wikimedia.org/T362323) [11:01:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1183.eqiad.wmnet [11:03:38] (03CR) 10Effie Mouzeli: [C:03+1] scap: make mw1407 a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1026520 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [11:04:17] (03CR) 10Hnowlan: [C:03+2] scap: make mw1407 a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1026520 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [11:04:37] (03CR) 10Hnowlan: scap: make mw1407 a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1026520 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [11:04:45] (03PS2) 10Hnowlan: scap: make mw1407 a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1026520 (https://phabricator.wikimedia.org/T362323) [11:05:16] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:05:33] (03CR) 10Hnowlan: [C:03+2] scap: make mw1407 a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/1026520 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [11:06:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:07:20] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ganeti01.svc.magru.wmnet on all recursors [11:07:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti01.svc.magru.wmnet on all recursors [11:07:37] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ganeti01.svc.magru.wmnet. on all recursors [11:07:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti01.svc.magru.wmnet. on all recursors [11:08:23] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache ganeti01.svc.magru.wmnet on all recursors [11:08:26] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti01.svc.magru.wmnet on all recursors [11:13:56] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1185.eqiad.wmnet [11:14:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P61697 and previous config saved to /var/cache/conftool/dbconfig/20240502-111410-marostegui.json [11:14:52] (03PS1) 10Muehlenhoff: Switch db1185 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026524 (https://phabricator.wikimedia.org/T349619) [11:15:33] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1371.eqiad.wmnet with OS bullseye [11:15:52] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9763469 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1371.... [11:16:03] (03PS1) 10Cathal Mooney: Add include for svc.magru.wmnet in wmnet zone [dns] - 10https://gerrit.wikimedia.org/r/1026525 (https://phabricator.wikimedia.org/T362421) [11:16:55] (03CR) 10CI reject: [V:04-1] Add include for svc.magru.wmnet in wmnet zone [dns] - 10https://gerrit.wikimedia.org/r/1026525 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [11:17:12] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1409.eqiad.wmnet with OS bullseye [11:17:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:27] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9763472 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1409.... [11:18:35] (03PS2) 10Cathal Mooney: Add include for svc.magru.wmnet in wmnet zone [dns] - 10https://gerrit.wikimedia.org/r/1026525 (https://phabricator.wikimedia.org/T362421) [11:18:51] (03CR) 10Muehlenhoff: [C:03+2] Switch db1185 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026524 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:19:41] (03CR) 10Ssingh: Add include for svc.magru.wmnet in wmnet zone (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1026525 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [11:20:23] (03PS1) 10Ssingh: magru: add ncredir configuration [puppet] - 10https://gerrit.wikimedia.org/r/1026526 (https://phabricator.wikimedia.org/T346722) [11:21:12] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1435.eqiad.wmnet with OS bullseye [11:21:14] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1399.eqiad.wmnet with OS bullseye [11:21:15] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1405.eqiad.wmnet with OS bullseye [11:21:24] (03PS2) 10Ssingh: magru: add ncredir configuration [puppet] - 10https://gerrit.wikimedia.org/r/1026526 (https://phabricator.wikimedia.org/T346722) [11:21:29] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9763477 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1435.... [11:21:30] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9763478 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1399.... [11:21:31] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9763479 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1405.... [11:22:39] PROBLEM - Check whether ferm is active by checking the default input chain on mw1362 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:24:15] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1021 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:24:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1185.eqiad.wmnet [11:25:36] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1210.eqiad.wmnet [11:26:37] (03PS1) 10Muehlenhoff: Switch db1210 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026528 (https://phabricator.wikimedia.org/T349619) [11:27:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:27:22] (03CR) 10Ssingh: [C:03+1] "Thanks for the patch! Happy to take care of merging this." [puppet] - 10https://gerrit.wikimedia.org/r/1024651 (owner: 10Muehlenhoff) [11:27:26] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:49] PROBLEM - Check whether ferm is active by checking the default input chain on mw1351 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:28:31] PROBLEM - Check whether ferm is active by checking the default input chain on mw1383 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:28:56] (03CR) 10Muehlenhoff: [C:03+2] Switch db1210 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026528 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:29:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P61698 and previous config saved to /var/cache/conftool/dbconfig/20240502-112918-marostegui.json [11:29:47] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1371.eqiad.wmnet with reason: host reimage [11:30:59] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1409.eqiad.wmnet with reason: host reimage [11:31:29] ^appservers is me, acked. prometheus run will fix it [11:31:38] er puppet run [11:32:22] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1371.eqiad.wmnet with reason: host reimage [11:32:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:29] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1435.eqiad.wmnet with reason: host reimage [11:34:45] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1399.eqiad.wmnet with reason: host reimage [11:35:11] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1405.eqiad.wmnet with reason: host reimage [11:35:28] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1409.eqiad.wmnet with reason: host reimage [11:35:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1210.eqiad.wmnet [11:37:01] (03PS3) 10Cathal Mooney: Add include for svc.magru.wmnet in wmnet zone [dns] - 10https://gerrit.wikimedia.org/r/1026525 (https://phabricator.wikimedia.org/T362421) [11:37:26] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:27] (03CR) 10Cathal Mooney: Add include for svc.magru.wmnet in wmnet zone (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1026525 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [11:37:32] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1405.eqiad.wmnet with reason: host reimage [11:38:51] (03CR) 10Ssingh: [C:03+1] Add include for svc.magru.wmnet in wmnet zone [dns] - 10https://gerrit.wikimedia.org/r/1026525 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [11:39:26] (03CR) 10Cathal Mooney: [C:03+2] Add include for svc.magru.wmnet in wmnet zone [dns] - 10https://gerrit.wikimedia.org/r/1026525 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [11:39:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1435.eqiad.wmnet with reason: host reimage [11:40:53] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:41:04] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache ganeti01.svc.magru.wmnet on all recursors [11:41:07] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti01.svc.magru.wmnet on all recursors [11:41:22] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:41:30] (03PS2) 10Elukey: role::sessionstore: move Cassandra instances to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1025810 (https://phabricator.wikimedia.org/T352647) [11:41:51] (03CR) 10Muehlenhoff: "I'll just go ahead and merge now." [puppet] - 10https://gerrit.wikimedia.org/r/1024651 (owner: 10Muehlenhoff) [11:41:53] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Traffic services [puppet] - 10https://gerrit.wikimedia.org/r/1024651 (owner: 10Muehlenhoff) [11:42:25] FIRING: [7x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:47] !log elukey@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=inference,name=codfw [11:42:56] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1213.eqiad.wmnet [11:43:00] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1399.eqiad.wmnet with reason: host reimage [11:43:35] !log depool LiftWing's codfw services from traffic to move all MW API calls to mw-api-int-ro [11:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:48] (03PS1) 10Muehlenhoff: Switch db1213 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026531 (https://phabricator.wikimedia.org/T349619) [11:44:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T361627)', diff saved to https://phabricator.wikimedia.org/P61699 and previous config saved to /var/cache/conftool/dbconfig/20240502-114425-marostegui.json [11:44:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2210.codfw.wmnet with reason: Maintenance [11:44:28] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:44:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2210.codfw.wmnet with reason: Maintenance [11:44:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T361627)', diff saved to https://phabricator.wikimedia.org/P61700 and previous config saved to /var/cache/conftool/dbconfig/20240502-114448-marostegui.json [11:46:45] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:46:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet [11:48:22] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:48:33] PROBLEM - Check whether ferm is active by checking the default input chain on mw1367 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:49:42] (03CR) 10Muehlenhoff: [C:03+2] Switch db1213 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026531 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:51:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1371.eqiad.wmnet with OS bullseye [11:51:59] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9763643 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1371.eqia... [11:52:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw1362 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:53:36] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [11:53:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1409.eqiad.wmnet with OS bullseye [11:53:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T361627)', diff saved to https://phabricator.wikimedia.org/P61701 and previous config saved to /var/cache/conftool/dbconfig/20240502-115353-marostegui.json [11:53:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1213.eqiad.wmnet [11:53:58] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:53:59] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9763647 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1409.eqia... [11:54:15] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1021 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:54:19] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [11:55:38] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [11:55:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1405.eqiad.wmnet with OS bullseye [11:56:00] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9763650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1405.eqia... [11:56:38] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1230.eqiad.wmnet [11:56:53] (03CR) 10Elukey: [C:03+2] revscoring-editquality-damaging: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018997 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [11:57:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet [11:57:14] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1435.eqiad.wmnet with OS bullseye [11:57:17] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [11:57:27] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9763652 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1435.eqia... [11:57:33] (03PS1) 10Muehlenhoff: Switch db1230 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026535 (https://phabricator.wikimedia.org/T349619) [11:57:49] RECOVERY - Check whether ferm is active by checking the default input chain on mw1351 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:58:31] RECOVERY - Check whether ferm is active by checking the default input chain on mw1383 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1200) [12:00:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1399.eqiad.wmnet with OS bullseye [12:01:12] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9763678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1399.eqia... [12:02:49] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:07:15] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:07:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:09:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P61702 and previous config saved to /var/cache/conftool/dbconfig/20240502-120901-marostegui.json [12:10:25] FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:12:22] (03CR) 10Elukey: [C:03+2] article-description: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018960 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:12:48] (03PS4) 10Clément Goubert: articletopic-outlink: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018962 (https://phabricator.wikimedia.org/T362316) [12:13:03] (03CR) 10Elukey: [C:03+2] readability: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018965 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:13:07] (03PS1) 10Muehlenhoff: netbox: Add ganeti01/magru [puppet] - 10https://gerrit.wikimedia.org/r/1026536 (https://phabricator.wikimedia.org/T363978) [12:13:12] (03CR) 10Muehlenhoff: [C:03+2] Switch db1230 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1026535 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:13:17] (03CR) 10Elukey: [C:03+2] revertrisk: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018987 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:13:30] (03CR) 10Elukey: [C:03+2] revscoring-articlequality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018989 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:13:50] (03CR) 10Elukey: [C:03+2] revscoring-articletopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018991 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:14:13] (03CR) 10Elukey: [C:03+2] revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018999 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:14:29] (03CR) 10Elukey: [C:03+2] revscoring-editquality-reverted: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019001 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:14:42] (03CR) 10Elukey: [C:03+2] articletopic-outlink: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018962 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [12:15:13] (03PS1) 10Slyngshede: P:idm, add LDAP authentication dependencies. [puppet] - 10https://gerrit.wikimedia.org/r/1026537 (https://phabricator.wikimedia.org/T362128) [12:15:56] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [12:17:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1230.eqiad.wmnet [12:17:57] (03PS1) 10Marostegui: db2161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026538 [12:18:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2161', diff saved to https://phabricator.wikimedia.org/P61703 and previous config saved to /var/cache/conftool/dbconfig/20240502-121759-root.json [12:18:33] RECOVERY - Check whether ferm is active by checking the default input chain on mw1367 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:18:40] (03CR) 10Marostegui: [C:03+2] db2161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1026538 (owner: 10Marostegui) [12:18:41] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1026539 (owner: 10L10n-bot) [12:19:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2161.codfw.wmnet with OS bookworm [12:20:18] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:20:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1026537 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [12:20:27] FIRING: ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:40] (03CR) 10Ayounsi: [C:04-1] "ganeti01.svc.magru.wmnet is public IP 195.200.68.6, while it should be a private IP." [puppet] - 10https://gerrit.wikimedia.org/r/1026536 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [12:22:50] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [12:23:19] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9763711 (10MoritzMuehlenhoff) [12:23:54] RESOLVED: ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P61704 and previous config saved to /var/cache/conftool/dbconfig/20240502-122409-marostegui.json [12:25:19] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2161.codfw.wmnet with OS bookworm [12:26:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2161.codfw.wmnet with OS bookworm [12:28:07] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:29:44] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:32:25] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:33:05] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:33:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:34:19] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:35:37] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updated VIP for ganeti01/magru - jmm@cumin2002" [12:36:26] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:36:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updated VIP for ganeti01/magru - jmm@cumin2002" [12:36:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:38:46] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ganeti01.svc.magru.wmnet on all recursors [12:38:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti01.svc.magru.wmnet on all recursors [12:39:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T361627)', diff saved to https://phabricator.wikimedia.org/P61705 and previous config saved to /var/cache/conftool/dbconfig/20240502-123916-marostegui.json [12:39:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [12:39:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [12:39:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T361627)', diff saved to https://phabricator.wikimedia.org/P61706 and previous config saved to /var/cache/conftool/dbconfig/20240502-123939-marostegui.json [12:40:22] (03CR) 10Muehlenhoff: "Ah yes. Changed the VIP to a private IP." [puppet] - 10https://gerrit.wikimedia.org/r/1026536 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [12:43:11] (03CR) 10Ayounsi: [C:03+1] netbox: Add ganeti01/magru [puppet] - 10https://gerrit.wikimedia.org/r/1026536 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [12:43:23] (03PS1) 10Elukey: ml-services: update Host header for revscoring-articletopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026551 [12:43:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2161.codfw.wmnet with reason: host reimage [12:44:38] (03CR) 10Muehlenhoff: [C:03+2] netbox: Add ganeti01/magru [puppet] - 10https://gerrit.wikimedia.org/r/1026536 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [12:45:13] 10SRE-swift-storage, 06Commons: Commons: File not found - https://phabricator.wikimedia.org/T363995#9763728 (10Bugreporter) [12:45:21] (03CR) 10Elukey: [C:03+2] ml-services: update Host header for revscoring-articletopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026551 (owner: 10Elukey) [12:46:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2161.codfw.wmnet with reason: host reimage [12:48:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T361627)', diff saved to https://phabricator.wikimedia.org/P61707 and previous config saved to /var/cache/conftool/dbconfig/20240502-124857-marostegui.json [12:49:01] (03CR) 10Slyngshede: [C:03+2] P:idm, add LDAP authentication dependencies. [puppet] - 10https://gerrit.wikimedia.org/r/1026537 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [12:49:01] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:49:03] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [12:53:20] (03PS1) 10JMeybohm: New version of statds module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026555 (https://phabricator.wikimedia.org/T362978) [12:53:22] (03PS1) 10JMeybohm: modules: Add restrictedSecurityContext to statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026556 (https://phabricator.wikimedia.org/T362978) [12:53:43] (03PS1) 10Slyngshede: P:idm Add LDAP import. [puppet] - 10https://gerrit.wikimedia.org/r/1026557 (https://phabricator.wikimedia.org/T362128) [12:54:27] (03CR) 10Slyngshede: [C:03+2] P:idm Add LDAP import. [puppet] - 10https://gerrit.wikimedia.org/r/1026557 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [12:55:29] (03PS1) 10Marostegui: Revert "db2161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026485 [12:57:05] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:57:39] (03PS1) 10Elukey: ml-services: fix host header for revscoring articlequality wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026558 [12:58:38] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9763768 (10MoritzMuehlenhoff) [12:59:04] (03CR) 10Elukey: [C:03+2] ml-services: fix host header for revscoring articlequality wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026558 (owner: 10Elukey) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:02:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P61708 and previous config saved to /var/cache/conftool/dbconfig/20240502-130404-marostegui.json [13:04:46] (03PS2) 10Eevans: Add new user jsherman (deployment group) [puppet] - 10https://gerrit.wikimedia.org/r/1025847 (https://phabricator.wikimedia.org/T363377) [13:05:52] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1025847 (https://phabricator.wikimedia.org/T363377) (owner: 10Eevans) [13:08:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2161.codfw.wmnet with OS bookworm [13:11:50] (03CR) 10Marostegui: [C:03+2] Revert "db2161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1026485 (owner: 10Marostegui) [13:11:58] (03PS1) 10Elukey: httpbb: add article-descriptions to Lift Wing tests [puppet] - 10https://gerrit.wikimedia.org/r/1026559 [13:12:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61709 and previous config saved to /var/cache/conftool/dbconfig/20240502-131225-root.json [13:15:58] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:19:09] 10ops-codfw, 06SRE: PowerSupplyFailure - https://phabricator.wikimedia.org/T363926#9763813 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated psu2 and cable. alert cleared on machine. [13:19:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P61710 and previous config saved to /var/cache/conftool/dbconfig/20240502-131912-marostegui.json [13:20:25] FIRING: [4x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:12] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:23:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:23:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet [13:24:24] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:24:33] (03CR) 10AikoChou: [C:03+1] httpbb: add article-descriptions to Lift Wing tests [puppet] - 10https://gerrit.wikimedia.org/r/1026559 (owner: 10Elukey) [13:24:41] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:27:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61711 and previous config saved to /var/cache/conftool/dbconfig/20240502-132731-root.json [13:28:16] (03CR) 10Ssingh: [C:03+2] Revert "magru: depool geoip/text*" [dns] - 10https://gerrit.wikimedia.org/r/1026484 (owner: 10Ssingh) [13:28:20] (03PS2) 10Ssingh: Revert "magru: depool geoip/text*" [dns] - 10https://gerrit.wikimedia.org/r/1026484 [13:30:47] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1026484 (owner: 10Ssingh) [13:32:05] !log running authdns-update to revert magru text geomap [13:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet [13:34:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T361627)', diff saved to https://phabricator.wikimedia.org/P61712 and previous config saved to /var/cache/conftool/dbconfig/20240502-133420-marostegui.json [13:34:25] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:35:25] FIRING: [4x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2140.codfw.wmnet with reason: Maintenance [13:35:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2140.codfw.wmnet with reason: Maintenance [13:36:40] (03PS1) 10Ssingh: P:cumin: update aliases for ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1026562 [13:37:36] (03PS2) 10Ssingh: P:cumin: update aliases for ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1026562 [13:38:46] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2231/co" [puppet] - 10https://gerrit.wikimedia.org/r/1026562 (owner: 10Ssingh) [13:40:15] PROBLEM - MariaDB Replica SQL: s3 #page on db1175 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table pagelinks is corrupt: try to repair it on query. Default database: mswiktionary. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:40:16] PROBLEM - MariaDB Replica SQL: s3 #page on db1189 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table pagelinks is corrupt: try to repair it on query. Default database: mswiktionary. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:40:24] nice... [13:40:25] depooling [13:40:27] (03PS1) 10JMeybohm: New chart from scaffold: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026563 (https://phabricator.wikimedia.org/T362310) [13:40:31] (03PS1) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) [13:40:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1175 db1189', diff saved to https://phabricator.wikimedia.org/P61713 and previous config saved to /var/cache/conftool/dbconfig/20240502-134050-root.json [13:40:54] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:41:05] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:41:06] !incidents [13:41:07] 4647 (ACKED) db1189 (paged)/MariaDB Replica SQL: s3 (paged) [13:41:07] 4648 (ACKED) db1175 (paged)/MariaDB Replica SQL: s3 (paged) [13:41:18] (03CR) 10CI reject: [V:04-1] New chart from scaffold: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026563 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [13:41:22] here [13:41:36] let me see [13:42:17] Amir1: I am with db1175 [13:42:22] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:42:24] fixed [13:42:27] I am fixing the other one now [13:42:30] FIRING: [2x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b3-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [13:42:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61714 and previous config saved to /var/cache/conftool/dbconfig/20240502-134237-root.json [13:42:37] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:42:54] both fixed now [13:43:01] (03CR) 10Eevans: [C:03+1] role::sessionstore: move Cassandra instances to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1025810 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:43:11] Going to start repooling them slowly [13:43:16] RECOVERY - MariaDB Replica SQL: s3 #page on db1175 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:43:17] RECOVERY - MariaDB Replica SQL: s3 #page on db1189 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:43:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61715 and previous config saved to /var/cache/conftool/dbconfig/20240502-134328-root.json [13:43:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61716 and previous config saved to /var/cache/conftool/dbconfig/20240502-134333-root.json [13:43:51] oh thanks [13:43:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7003.magru.wmnet to cluster magru01 and group B3 [13:43:55] you're so fast [13:43:57] :) [13:43:59] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7003.magru.wmnet to cluster magru01 and group B3 [13:44:06] (03PS6) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [13:44:11] Thanks [13:44:26] I will create a task to track it just in case [13:44:35] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [13:45:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_magru01_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:29] (03PS1) 10Muehlenhoff: Make ganeti7003 as Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026565 (https://phabricator.wikimedia.org/T363978) [13:47:24] (03CR) 10Muehlenhoff: [C:03+2] Make ganeti7003 as Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1026565 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [13:50:05] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:50:10] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:50:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:50:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:51:01] (03PS17) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) [13:52:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7003.magru.wmnet to cluster magru01 and group B3 [13:52:58] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:53:02] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:53:26] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7003.magru.wmnet to cluster magru01 and group B3 [13:54:26] !log running homer 'cr*eqiad*' commit for new kubernetes workers [13:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow7001.magru.wmnet [13:56:06] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:56:09] (03PS10) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [13:56:31] 10SRE-swift-storage, 06Commons: Commons: File not found - https://phabricator.wikimedia.org/T363995#9763970 (10MatthewVernon) The earliest request in our logs for an original is from 28 Apr, already 404: ` moss-fe1001.eqiad.wmnet: /var/log/swift/proxy-access.log.4.gz:Apr 28 10:22:47 moss-fe1001 proxy-server: E... [13:57:27] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9763973 (10Jclark-ctr) @andrea.denisse We have been having a few issues with software raids we are trying to pinpoint what slot these are in. Idrac is not listing the drives. I will message you for assistance [13:57:39] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [13:57:42] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [13:57:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61717 and previous config saved to /var/cache/conftool/dbconfig/20240502-135743-root.json [13:58:00] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:58:07] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:58:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61718 and previous config saved to /var/cache/conftool/dbconfig/20240502-135833-root.json [13:58:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61719 and previous config saved to /var/cache/conftool/dbconfig/20240502-135839-root.json [13:59:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:59:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:59:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T361627)', diff saved to https://phabricator.wikimedia.org/P61720 and previous config saved to /var/cache/conftool/dbconfig/20240502-135947-marostegui.json [13:59:50] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:00:43] jouncebot: next [14:00:43] In 1 hour(s) and 59 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1600) [14:01:26] @marostegui: I wanted to deploy the train, is the DB issue still going on? should I wait for you? [14:02:08] marostegui: botched ping ^ [14:02:15] jnuche: no no, it's all fixed. You can go ahead [14:02:25] thanks a ton [14:03:42] (03CR) 10JMeybohm: [C:04-1] admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [14:04:04] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7001.magru.wmnet - jmm@cumin2002" [14:04:09] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026572 (https://phabricator.wikimedia.org/T361397) [14:04:11] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026572 (https://phabricator.wikimedia.org/T361397) (owner: 10TrainBranchBot) [14:04:52] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026572 (https://phabricator.wikimedia.org/T361397) (owner: 10TrainBranchBot) [14:04:56] (03PS1) 10Muehlenhoff: Enable kafka access for netflow7001 [puppet] - 10https://gerrit.wikimedia.org/r/1026573 [14:04:56] !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1371.eqiad.wmnet|mw1399.eqiad.wmnet|mw1405.eqiad.wmnet|mw1409.eqiad.wmnet|mw1435.eqiad.wmnet),cluster=kubernetes,service=kubesvc [14:07:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7001.magru.wmnet - jmm@cumin2002" [14:07:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:07:19] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow7001.magru.wmnet on all recursors [14:07:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7001.magru.wmnet on all recursors [14:07:45] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7001.magru.wmnet - jmm@cumin2002" [14:08:09] (03CR) 10Hnowlan: [C:03+2] mw-web, mw-api-ext: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026160 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:08:34] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1026539 (owner: 10L10n-bot) [14:08:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7001.magru.wmnet - jmm@cumin2002" [14:08:44] (03PS4) 10Eevans: {session,echo}store: update defaults for PKI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) [14:08:46] (03PS7) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [14:08:59] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026160 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:09:14] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:09:36] (03CR) 10CI reject: [V:04-1] {session,echo}store: update defaults for PKI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [14:09:53] (03PS8) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [14:10:23] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:10:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T361627)', diff saved to https://phabricator.wikimedia.org/P61721 and previous config saved to /var/cache/conftool/dbconfig/20240502-141046-marostegui.json [14:10:49] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:12:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow7001.magru.wmnet with OS bookworm [14:12:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61722 and previous config saved to /var/cache/conftool/dbconfig/20240502-141248-root.json [14:13:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61723 and previous config saved to /var/cache/conftool/dbconfig/20240502-141339-root.json [14:13:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61724 and previous config saved to /var/cache/conftool/dbconfig/20240502-141344-root.json [14:15:42] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:15:50] (03PS1) 10Ayounsi: magru: add netflow7001 to kafka ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1026577 (https://phabricator.wikimedia.org/T362421) [14:15:51] (03PS9) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [14:16:23] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:17:02] (03PS1) 10Ayounsi: magru: set netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/1026578 [14:18:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:18:33] (03PS10) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [14:18:41] PROBLEM - Check whether ferm is active by checking the default input chain on mw2412 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:19:06] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:19:55] (03Abandoned) 10Ayounsi: magru: add netflow7001 to kafka ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1026577 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [14:20:05] (03CR) 10Ayounsi: [C:03+1] Enable kafka access for netflow7001 [puppet] - 10https://gerrit.wikimedia.org/r/1026573 (owner: 10Muehlenhoff) [14:20:08] (03PS5) 10Eevans: {session,echo}store: update defaults for PKI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) [14:21:00] (03CR) 10CI reject: [V:04-1] {session,echo}store: update defaults for PKI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [14:21:17] (03CR) 10Elukey: [C:03+2] httpbb: add article-descriptions to Lift Wing tests [puppet] - 10https://gerrit.wikimedia.org/r/1026559 (owner: 10Elukey) [14:21:17] (03PS11) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [14:21:50] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:22:15] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9764062 (10andrea.denisse) Hello, here's the output of `ls -la /dev/disk/by-path/` as requested: ` total 0 drwxr-xr-x 2 root root 840 Mar 28 15:46 . drwxr-xr-x 6 root root 120 Aug 22 2023 .. lrwxrwxrwx 1 root root... [14:22:45] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.3 refs T361397 [14:22:48] T361397: 1.43.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T361397 [14:23:21] jouncebot: nowandnext [14:23:21] No deployments scheduled for the next 1 hour(s) and 36 minute(s) [14:23:21] In 1 hour(s) and 36 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1600) [14:23:39] !log hnowlan@deploy1002 Started scap: (no justification provided) [14:23:52] (03PS6) 10Eevans: {session,echo}store: update defaults for PKI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) [14:24:44] (03CR) 10CI reject: [V:04-1] {session,echo}store: update defaults for PKI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [14:25:12] (03CR) 10Eevans: [C:03+2] Add new user jsherman (deployment group) [puppet] - 10https://gerrit.wikimedia.org/r/1025847 (https://phabricator.wikimedia.org/T363377) (owner: 10Eevans) [14:25:22] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:25:25] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:25:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P61725 and previous config saved to /var/cache/conftool/dbconfig/20240502-142554-marostegui.json [14:26:01] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:26:03] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:26:19] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:26:23] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:26:55] !log hnowlan@deploy1002 Finished scap: (no justification provided) (duration: 03m 16s) [14:27:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61726 and previous config saved to /var/cache/conftool/dbconfig/20240502-142754-root.json [14:28:25] jouncebot: next [14:28:25] In 1 hour(s) and 31 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1600) [14:28:41] (03PS11) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) [14:28:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61727 and previous config saved to /var/cache/conftool/dbconfig/20240502-142844-root.json [14:28:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61728 and previous config saved to /var/cache/conftool/dbconfig/20240502-142850-root.json [14:28:59] (03CR) 10Effie Mouzeli: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [14:38:54] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:56] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7001.magru.wmnet with reason: host reimage [14:40:08] (03PS12) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [14:40:43] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:41:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P61729 and previous config saved to /var/cache/conftool/dbconfig/20240502-144101-marostegui.json [14:42:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7001.magru.wmnet with reason: host reimage [14:43:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61730 and previous config saved to /var/cache/conftool/dbconfig/20240502-144300-root.json [14:43:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61731 and previous config saved to /var/cache/conftool/dbconfig/20240502-144350-root.json [14:43:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61732 and previous config saved to /var/cache/conftool/dbconfig/20240502-144356-root.json [14:45:53] (03CR) 10Elukey: [C:03+2] role::sessionstore: move Cassandra instances to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1025810 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [14:48:19] (03PS1) 10Santiago Faci: mediawiki_history_reduced_snaphost automation: Updating editor-analytics helmfiles to deploy to staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026586 (https://phabricator.wikimedia.org/T355408) [14:48:40] RECOVERY - Check whether ferm is active by checking the default input chain on mw2412 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:49:08] (03CR) 10Andrew Bogott: "do we know for sure that g10k implements ignored_branch?" [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (owner: 10Andrew Bogott) [14:50:01] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2004*: Move to PKI Truststore - elukey@cumin1002 [14:50:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7004.magru.wmnet [14:50:37] (03CR) 10Andrew Bogott: "The string 'ignored_branch' doesn't appear in the g10k source or documentation, but maybe I'm missing something :)" [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (owner: 10Andrew Bogott) [14:55:02] (03PS1) 10Muehlenhoff: Add magru02 to netbox config [puppet] - 10https://gerrit.wikimedia.org/r/1026587 (https://phabricator.wikimedia.org/T363978) [14:55:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:56:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2004*: Move to PKI Truststore - elukey@cumin1002 [14:56:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T361627)', diff saved to https://phabricator.wikimedia.org/P61733 and previous config saved to /var/cache/conftool/dbconfig/20240502-145609-marostegui.json [14:56:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1190.eqiad.wmnet with reason: Maintenance [14:56:12] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:56:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1190.eqiad.wmnet with reason: Maintenance [14:56:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T361627)', diff saved to https://phabricator.wikimedia.org/P61734 and previous config saved to /var/cache/conftool/dbconfig/20240502-145632-marostegui.json [14:56:58] PROBLEM - cassandra-a SSL 10.192.16.247:7000 on sessionstore2004 is CRITICAL: SSL CRITICAL - failed to verify sessionstore2004-a against sessionstore2004-a.codfw.wmnet, cassandra, sessionstore2004.codfw.wmnet:Certificate sessionstore2004-a.codfw.wmnet valid until 2024-05-30 14:44:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:58:02] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for ganeti/magru02 - jmm@cumin2002" [14:58:54] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61735 and previous config saved to /var/cache/conftool/dbconfig/20240502-145856-root.json [14:59:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61736 and previous config saved to /var/cache/conftool/dbconfig/20240502-145901-root.json [14:59:52] (03CR) 10Muehlenhoff: [C:03+2] Enable kafka access for netflow7001 [puppet] - 10https://gerrit.wikimedia.org/r/1026573 (owner: 10Muehlenhoff) [15:00:09] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore200[5,6]*: Move to PKI Truststore - elukey@cumin1002 [15:00:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7004.magru.wmnet [15:03:08] (03PS2) 10Ayounsi: magru: set netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/1026578 (https://phabricator.wikimedia.org/T362421) [15:03:20] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=inference,name=codfw [15:04:03] (03CR) 10Hnowlan: [C:03+2] trafficserver: move 85% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1026159 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [15:04:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1026578 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [15:05:00] (03CR) 10Ayounsi: [C:03+2] magru: set netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/1026578 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [15:05:36] (03Merged) 10jenkins-bot: magru: set netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/1026578 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [15:06:30] (03CR) 10Vgutierrez: [C:03+1] magru: add ncredir configuration [puppet] - 10https://gerrit.wikimedia.org/r/1026526 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [15:07:04] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026593 (https://phabricator.wikimedia.org/T349774) [15:07:37] (03CR) 10Vgutierrez: [C:03+1] P:cumin: update aliases for ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1026562 (owner: 10Ssingh) [15:08:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T361627)', diff saved to https://phabricator.wikimedia.org/P61737 and previous config saved to /var/cache/conftool/dbconfig/20240502-150812-marostegui.json [15:08:16] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:08:34] (03PS13) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [15:09:03] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026593 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [15:09:09] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [15:10:09] (03CR) 10Btullis: mediawiki_history_reduced_snaphost automation: Updating editor-analytics helmfiles to deploy to staging environment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026586 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [15:10:13] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026593 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [15:10:33] !log Move mw-on-k8s traffic percentage from 80% to 85% [15:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:36] (03CR) 10Vgutierrez: "looking good, almost there 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [15:11:38] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:12:00] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:12:01] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:12:20] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore200[5,6]*: Move to PKI Truststore - elukey@cumin1002 [15:12:38] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:12:39] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:13:00] PROBLEM - cassandra-a SSL 10.192.48.242:7000 on sessionstore2006 is CRITICAL: SSL CRITICAL - failed to verify sessionstore2006-a against sessionstore2006-a.codfw.wmnet, cassandra, sessionstore2006.codfw.wmnet:Certificate sessionstore2006-a.codfw.wmnet valid until 2024-05-30 14:54:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:13:00] PROBLEM - cassandra-a SSL 10.192.32.237:7000 on sessionstore2005 is CRITICAL: SSL CRITICAL - failed to verify sessionstore2005-a against sessionstore2005-a.codfw.wmnet, cassandra, sessionstore2005.codfw.wmnet:Certificate sessionstore2005-a.codfw.wmnet valid until 2024-05-30 14:53:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:13:07] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:13:26] the above errors are old nagios ones (sessionstore), nothing is on fire --^ [15:13:33] the new prometheus alerts are ok [15:14:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61738 and previous config saved to /var/cache/conftool/dbconfig/20240502-151403-root.json [15:14:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61739 and previous config saved to /var/cache/conftool/dbconfig/20240502-151407-root.json [15:15:44] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*: Move to PKI Truststore - elukey@cumin1002 [15:15:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for ganeti/magru02 - jmm@cumin2002" [15:15:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:13] (03CR) 10Milimetric: mediawiki_history_reduced_snaphost automation: Updating editor-analytics helmfiles to deploy to staging environment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026586 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [15:18:54] FIRING: ProbeDown: Service sessionstore1006-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore1006-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:27] FIRING: [4x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:40] elukey: related? ^^^ [15:21:07] (03CR) 10Muehlenhoff: [C:03+2] Druid: overlord/coordinator: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024541 (owner: 10Muehlenhoff) [15:21:42] volans: yeah so puppet ran on those nodes and added the new blackbox exporter, but the new cert is still not there (pending restarts of cassandras) so it fires [15:21:49] but nothing is really on fire [15:21:55] I am checking the status of the cluster [15:22:04] (I am getting rid of the nagios alerts too) [15:22:11] great [15:23:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P61740 and previous config saved to /var/cache/conftool/dbconfig/20240502-152319-marostegui.json [15:23:54] FIRING: [4x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:24:02] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 45 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:24:16] (03CR) 10Muehlenhoff: [C:03+2] druid::broker: Switch public workers to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024409 (owner: 10Muehlenhoff) [15:26:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7001.magru.wmnet with OS bookworm [15:26:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7001.magru.wmnet [15:26:39] (03CR) 10Muehlenhoff: [C:03+2] druid::broker: Switch analytics workers to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024410 (owner: 10Muehlenhoff) [15:27:12] (03PS1) 10Milimetric: Update commons impact metrics readme [puppet] - 10https://gerrit.wikimedia.org/r/1026597 (https://phabricator.wikimedia.org/T358701) [15:27:38] (03CR) 10Milimetric: [C:03+1] Update commons impact metrics readme [puppet] - 10https://gerrit.wikimedia.org/r/1026597 (https://phabricator.wikimedia.org/T358701) (owner: 10Milimetric) [15:29:02] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 29 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:30:27] RESOLVED: [4x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:34:10] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*: Move to PKI Truststore - elukey@cumin1002 [15:34:17] (03PS2) 10Elukey: role::sessionstore: cleanup unused TLS settings after PKI migration [puppet] - 10https://gerrit.wikimedia.org/r/1025811 (https://phabricator.wikimedia.org/T352647) [15:35:13] (03CR) 10Ayounsi: [C:03+1] Add magru02 to netbox config [puppet] - 10https://gerrit.wikimedia.org/r/1026587 (https://phabricator.wikimedia.org/T363978) (owner: 10Muehlenhoff) [15:36:06] (03CR) 10Eevans: [C:03+1] role::sessionstore: cleanup unused TLS settings after PKI migration [puppet] - 10https://gerrit.wikimedia.org/r/1025811 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:38:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P61741 and previous config saved to /var/cache/conftool/dbconfig/20240502-153828-marostegui.json [15:39:45] !log sukhe@cumin1002 START - Cookbook sre.ganeti.makevm for new host durum7001.magru.wmnet [15:39:46] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [15:40:29] (03CR) 10Elukey: [C:03+2] role::sessionstore: cleanup unused TLS settings after PKI migration [puppet] - 10https://gerrit.wikimedia.org/r/1025811 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:41:12] (03CR) 10Ssingh: [V:03+1 C:03+2] P:cumin: update aliases for ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1026562 (owner: 10Ssingh) [15:41:52] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7001.magru.wmnet - sukhe@cumin1002" [15:42:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7001.magru.wmnet - sukhe@cumin1002" [15:42:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:42:43] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache durum7001.magru.wmnet on all recursors [15:42:47] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7001.magru.wmnet on all recursors [15:43:07] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7001.magru.wmnet - sukhe@cumin1002" [15:43:55] (03CR) 10Muehlenhoff: "Looks fine per se, but needs sign off in the next SRE IF weekly meeting (next Monday)" [puppet] - 10https://gerrit.wikimedia.org/r/1026194 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [15:43:59] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7001.magru.wmnet - sukhe@cumin1002" [15:44:30] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum7001.magru.wmnet with OS bookworm [15:47:05] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9764334 (10Eevans) 05In progress→03Resolved Hi @jsn.sherman, You've been added to the deployment group, your shell username is `jsn` (same as wmf cloud). Let me know i... [15:51:28] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org,service=(authdns-update|recdns|ntp) [15:51:32] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org,service=(authdns-update|recdns|ntp) [15:51:48] !log installing postgresql-15 security updates [15:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T361627)', diff saved to https://phabricator.wikimedia.org/P61743 and previous config saved to /var/cache/conftool/dbconfig/20240502-155336-marostegui.json [15:53:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1199.eqiad.wmnet with reason: Maintenance [15:53:39] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:53:42] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Apply updated JDK 8 - eevans@cumin1002 [15:53:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1199.eqiad.wmnet with reason: Maintenance [15:53:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T361627)', diff saved to https://phabricator.wikimedia.org/P61744 and previous config saved to /var/cache/conftool/dbconfig/20240502-155359-marostegui.json [15:54:35] PROBLEM - HTTPS Ganeti RAPI magru on ganeti7001 is CRITICAL: Name or service not known https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [15:55:13] PROBLEM - Host lsw1-a7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:13] PROBLEM - Host lsw1-b8-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:19] hm? [15:55:37] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:55:39] PROBLEM - Host ps1-b8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:56:06] !log running authdns-update [15:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:30] RESOLVED: [2x] Not accepting/receiving prefixes from anycast BGP peer: Device asw1-b3-magru.mgmt.magru.wmnet recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [16:00:05] jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:04] (03PS1) 10Dreamrimmer: Add tm: as alias to template: on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026604 (https://phabricator.wikimedia.org/T363757) [16:01:05] RECOVERY - Host mw2382 is UP: PING WARNING - Packet loss = 80%, RTA = 40.86 ms [16:03:58] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [16:04:09] (03PS1) 10Esanders: Release DT visual enhancements to all except Wikipedia/Commons/Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026605 (https://phabricator.wikimedia.org/T352087) [16:05:24] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:06:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T361627)', diff saved to https://phabricator.wikimedia.org/P61746 and previous config saved to /var/cache/conftool/dbconfig/20240502-160606-marostegui.json [16:06:10] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:07:08] (03PS1) 10Ahmon Dancy: Use buildkit wmf-v0.13.2-1 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1026607 (https://phabricator.wikimedia.org/T364013) [16:07:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:28] (03CR) 10CI reject: [V:04-1] Use buildkit wmf-v0.13.2-1 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1026607 (https://phabricator.wikimedia.org/T364013) (owner: 10Ahmon Dancy) [16:08:03] (03PS2) 10Ahmon Dancy: Use buildkit wmf-v0.13.2-1 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1026607 (https://phabricator.wikimedia.org/T364013) [16:08:11] FIRING: Temperature: Inlet Temp issue on ms-be2077:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=ms-be2077 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [16:08:22] (03CR) 10CI reject: [V:04-1] Use buildkit wmf-v0.13.2-1 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1026607 (https://phabricator.wikimedia.org/T364013) (owner: 10Ahmon Dancy) [16:08:24] (03PS1) 10Santiago Faci: MPIC v.0.0.3: Deploying to staging environment a new test version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026608 (https://phabricator.wikimedia.org/T362144) [16:09:17] (03PS3) 10Ahmon Dancy: Use buildkit wmf-v0.13.2-1 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1026607 (https://phabricator.wikimedia.org/T364013) [16:09:36] (03PS1) 10Effie Mouzeli: (WIP) memcached: make the service run under the memcache user [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [16:09:53] (03CR) 10Clare Ming: [C:03+2] MPIC v.0.0.3: Deploying to staging environment a new test version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026608 (https://phabricator.wikimedia.org/T362144) (owner: 10Santiago Faci) [16:10:45] (03Merged) 10jenkins-bot: MPIC v.0.0.3: Deploying to staging environment a new test version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026608 (https://phabricator.wikimedia.org/T362144) (owner: 10Santiago Faci) [16:11:01] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [16:12:27] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [16:12:43] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [16:12:51] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: force update dns7x - sukhe@cumin1002" [16:13:19] 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9764471 (10elukey) To summarize previous discussions: we are currently relying on a TLS cert emitted by the puppet CA via cergen, a tool that we are trying to deprecate (see T... [16:13:38] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7001.magru.wmnet with reason: host reimage [16:14:02] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: force update dns7x - sukhe@cumin1002" [16:14:02] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:15:29] !log running authdns-update once again to confirm state of dns700[12] [16:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:44] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7001.magru.wmnet with reason: host reimage [16:17:32] (03CR) 10JHathaway: "I'm not very familiar with how g10k checks out code. We don't have any disk space issues in prod, how does WMCS' use differ?" [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (owner: 10Andrew Bogott) [16:20:47] !log amastilovic@deploy1002 Started deploy [airflow-dags/analytics@7513bfa]: (no justification provided) [16:21:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P61747 and previous config saved to /var/cache/conftool/dbconfig/20240502-162114-marostegui.json [16:21:22] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025860 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:21:31] !log amastilovic@deploy1002 Finished deploy [airflow-dags/analytics@7513bfa]: (no justification provided) (duration: 00m 44s) [16:23:11] RESOLVED: Temperature: Inlet Temp issue on ms-be2077:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=ms-be2077 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [16:32:36] 10ops-codfw: Order (2) Bag of Rack Studs for codfw - https://phabricator.wikimedia.org/T364026 (10Jhancock.wm) 03NEW [16:32:43] 10ops-codfw: Order (2) Bag of Rack Studs for codfw - https://phabricator.wikimedia.org/T364026#9764515 (10Jhancock.wm) p:05Triage→03High [16:35:45] (03PS1) 10Btullis: Change the destination directory for commons impact metrics dumps [puppet] - 10https://gerrit.wikimedia.org/r/1026611 (https://phabricator.wikimedia.org/T358701) [16:36:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P61748 and previous config saved to /var/cache/conftool/dbconfig/20240502-163622-marostegui.json [16:36:32] (03PS1) 10Santiago Faci: mpic v0.0.4: Deploying to staging a first version for mpic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026612 (https://phabricator.wikimedia.org/T362144) [16:36:48] (03CR) 10Clare Ming: [C:03+2] mpic v0.0.4: Deploying to staging a first version for mpic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026612 (https://phabricator.wikimedia.org/T362144) (owner: 10Santiago Faci) [16:37:16] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2232/co" [puppet] - 10https://gerrit.wikimedia.org/r/1026611 (https://phabricator.wikimedia.org/T358701) (owner: 10Btullis) [16:37:40] (03Merged) 10jenkins-bot: mpic v0.0.4: Deploying to staging a first version for mpic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026612 (https://phabricator.wikimedia.org/T362144) (owner: 10Santiago Faci) [16:38:14] (03CR) 10Btullis: [V:03+1 C:03+2] Change the destination directory for commons impact metrics dumps [puppet] - 10https://gerrit.wikimedia.org/r/1026611 (https://phabricator.wikimedia.org/T358701) (owner: 10Btullis) [16:38:54] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [16:39:07] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [16:40:13] (03CR) 10Dzahn: "these don't seem to exist in DNS yet?" [puppet] - 10https://gerrit.wikimedia.org/r/1025445 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [16:40:38] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7001.magru.wmnet with OS bookworm [16:40:39] !log sukhe@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7001.magru.wmnet [16:40:44] (03PS1) 10Btullis: Update the link in the commons_impact_metrics readme file [puppet] - 10https://gerrit.wikimedia.org/r/1026613 (https://phabricator.wikimedia.org/T358701) [16:41:57] (03CR) 10Xcollazo: [C:03+1] Update the link in the commons_impact_metrics readme file [puppet] - 10https://gerrit.wikimedia.org/r/1026613 (https://phabricator.wikimedia.org/T358701) (owner: 10Btullis) [16:42:12] (03CR) 10Btullis: [C:03+2] Update the link in the commons_impact_metrics readme file [puppet] - 10https://gerrit.wikimedia.org/r/1026613 (https://phabricator.wikimedia.org/T358701) (owner: 10Btullis) [16:43:44] (03CR) 10Dzahn: [C:03+1] "confirmed the aliases. lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [16:44:58] (03CR) 10Dzahn: [C:03+1] "so since https://gerrit.wikimedia.org/r/c/operations/puppet/+/1025860 is merged I assume the names are on the certs now and you can go ahe" [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [16:49:48] (03CR) 10Dzahn: [C:03+2] "only affects cloud test/staging setup" [puppet] - 10https://gerrit.wikimedia.org/r/1026195 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [16:50:12] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1025877 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:50:44] (03PS1) 10Santiago Faci: mpic: bumping version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026614 (https://phabricator.wikimedia.org/T362144) [16:51:06] (03CR) 10Dzahn: [C:03+1] ssl: Delete dummy TLS key for the Grafana hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1025877 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:51:13] (03CR) 10Clare Ming: [C:03+2] mpic: bumping version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026614 (https://phabricator.wikimedia.org/T362144) (owner: 10Santiago Faci) [16:51:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T361627)', diff saved to https://phabricator.wikimedia.org/P61749 and previous config saved to /var/cache/conftool/dbconfig/20240502-165129-marostegui.json [16:51:33] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:51:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1221.eqiad.wmnet with reason: Maintenance [16:51:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1221.eqiad.wmnet with reason: Maintenance [16:51:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:51:57] (03Merged) 10jenkins-bot: mpic: bumping version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026614 (https://phabricator.wikimedia.org/T362144) (owner: 10Santiago Faci) [16:52:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:52:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T361627)', diff saved to https://phabricator.wikimedia.org/P61750 and previous config saved to /var/cache/conftool/dbconfig/20240502-165211-marostegui.json [16:52:47] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [16:53:01] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [16:55:49] RECOVERY - HTTPS Ganeti RAPI magru on ganeti7001 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.021 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [16:57:03] (03CR) 10Andrea Denisse: [C:03+2] trafficserver: Add discovery entries for grafana and grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [16:58:28] (03CR) 10Dzahn: [C:03+1] "only affects devtools test setup. temp removing this problematic repo on a fresh deployment_server, will create a revert to discuss more t" [puppet] - 10https://gerrit.wikimedia.org/r/1026198 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [17:00:04] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1700). [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1700) [17:01:14] (03CR) 10Dzahn: [C:03+2] devtools: remove gervert from deployed repos in cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1026198 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [17:03:23] (03PS1) 10Btullis: Update the path of the readme file for commons impact metrics [puppet] - 10https://gerrit.wikimedia.org/r/1026618 (https://phabricator.wikimedia.org/T358701) [17:03:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T361627)', diff saved to https://phabricator.wikimedia.org/P61751 and previous config saved to /var/cache/conftool/dbconfig/20240502-170332-marostegui.json [17:03:36] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:04:28] (03CR) 10Btullis: [C:03+2] Update the path of the readme file for commons impact metrics [puppet] - 10https://gerrit.wikimedia.org/r/1026618 (https://phabricator.wikimedia.org/T358701) (owner: 10Btullis) [17:05:46] !log sukhe@cumin1002 START - Cookbook sre.ganeti.makevm for new host doh7001.wikimedia.org [17:05:47] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [17:12:08] (03PS1) 10Santiago Faci: mpic: bumping version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026620 (https://phabricator.wikimedia.org/T362144) [17:12:24] (03CR) 10Clare Ming: [C:03+2] mpic: bumping version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026620 (https://phabricator.wikimedia.org/T362144) (owner: 10Santiago Faci) [17:13:11] (03Merged) 10jenkins-bot: mpic: bumping version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026620 (https://phabricator.wikimedia.org/T362144) (owner: 10Santiago Faci) [17:15:10] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [17:15:23] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [17:16:46] (03CR) 10Dzahn: [C:03+1] Automate quarterly Phabricator data for WMF QLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024348 (https://phabricator.wikimedia.org/T362804) (owner: 10Aklapper) [17:18:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P61752 and previous config saved to /var/cache/conftool/dbconfig/20240502-171840-marostegui.json [17:24:38] !log brett@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir7001.magru.wmnet [17:24:40] !log brett@cumin2002 START - Cookbook sre.dns.netbox [17:25:21] (03CR) 10Andrea Denisse: [V:03+2] ssl: Delete dummy TLS key for the Grafana hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1025877 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:25:44] (03CR) 10Andrea Denisse: [V:03+2 C:04-2] ssl: Delete dummy TLS key for the Grafana hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1025877 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:25:52] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] ssl: Delete dummy TLS key for the Grafana hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1025877 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:31:49] (03PS1) 10Dzahn: delete cert for query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1026622 [17:33:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P61753 and previous config saved to /var/cache/conftool/dbconfig/20240502-173349-marostegui.json [17:34:04] (03PS2) 10Dzahn: delete cert for query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1026622 (https://phabricator.wikimedia.org/T333656) [17:36:14] (03PS1) 10Dzahn: delete fake key for query-preview.wikidata.org [labs/private] - 10https://gerrit.wikimedia.org/r/1026623 (https://phabricator.wikimedia.org/T333656) [17:36:21] (03PS1) 10Andrea Denisse: ssl: Delete unused certificate for the Grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026624 (https://phabricator.wikimedia.org/T360414) [17:36:47] (03PS2) 10Dzahn: delete fake key for query-preview.wikidata.org [labs/private] - 10https://gerrit.wikimedia.org/r/1026623 (https://phabricator.wikimedia.org/T333656) [17:37:09] (03CR) 10Dzahn: [C:03+1] ssl: Delete unused certificate for the Grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026624 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:38:26] (03PS1) 10Andrea Denisse: ssl: Delete unused certificate for the thanos-query hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026625 (https://phabricator.wikimedia.org/T360414) [17:38:32] (03CR) 10Andrea Denisse: [C:03+2] ssl: Delete unused certificate for the Grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026624 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:39:04] (03PS1) 10Dzahn: delete fake key for thanos-query.discovery.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/1026646 [17:40:51] (03PS1) 10Dzahn: delete fake key for prometheus.wikimedia.org [labs/private] - 10https://gerrit.wikimedia.org/r/1026647 [17:41:07] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9764859 (10andrea.denisse) [17:41:58] (03CR) 10Milimetric: [C:03+2] "Checked with Santi that these are just adding common changes to values.yaml and indeed meant for just staging deploy. The "datacentre" se" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026586 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [17:42:51] (03Merged) 10jenkins-bot: mediawiki_history_reduced_snaphost automation: Updating editor-analytics helmfiles to deploy to staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026586 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci) [17:42:59] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [labs/private] - 10https://gerrit.wikimedia.org/r/1026646 (owner: 10Dzahn) [17:43:55] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [labs/private] - 10https://gerrit.wikimedia.org/r/1026647 (owner: 10Dzahn) [17:44:36] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1026622 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [17:44:59] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [labs/private] - 10https://gerrit.wikimedia.org/r/1026623 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [17:47:29] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9764866 (10andrea.denisse) [17:47:33] (03PS2) 10Dzahn: delete fake key for thanos-query.discovery.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/1026646 [17:47:38] (03CR) 10Andrea Denisse: [C:03+2] delete fake key for thanos-query.discovery.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/1026646 (owner: 10Dzahn) [17:47:41] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] delete fake key for thanos-query.discovery.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/1026646 (owner: 10Dzahn) [17:47:52] (03PS2) 10Dzahn: delete fake key for prometheus.wikimedia.org [labs/private] - 10https://gerrit.wikimedia.org/r/1026647 [17:47:54] (03CR) 10Andrea Denisse: [C:03+2] delete fake key for prometheus.wikimedia.org [labs/private] - 10https://gerrit.wikimedia.org/r/1026647 (owner: 10Dzahn) [17:47:56] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] delete fake key for prometheus.wikimedia.org [labs/private] - 10https://gerrit.wikimedia.org/r/1026647 (owner: 10Dzahn) [17:48:53] (03CR) 10Andrea Denisse: [C:03+2] ssl: Delete unused certificate for the thanos-query hosts [puppet] - 10https://gerrit.wikimedia.org/r/1026625 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:48:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T361627)', diff saved to https://phabricator.wikimedia.org/P61754 and previous config saved to /var/cache/conftool/dbconfig/20240502-174856-marostegui.json [17:48:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1241.eqiad.wmnet with reason: Maintenance [17:49:00] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:49:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1241.eqiad.wmnet with reason: Maintenance [17:49:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T361627)', diff saved to https://phabricator.wikimedia.org/P61755 and previous config saved to /var/cache/conftool/dbconfig/20240502-174920-marostegui.json [17:50:02] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [17:50:24] (03PS1) 10Dzahn: delete phabricator-stage-1001.devtools.eqiad.wmflabs.key [labs/private] - 10https://gerrit.wikimedia.org/r/1026649 [17:52:40] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:53:16] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [17:53:36] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [17:54:57] (03PS1) 10Dzahn: delete bugzilla.wikimedia.org.key [labs/private] - 10https://gerrit.wikimedia.org/r/1026653 [17:55:01] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:55:01] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache doh7001.wikimedia.org on all recursors [17:55:04] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7001.wikimedia.org on all recursors [17:55:09] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [17:57:27] (03PS1) 10Ssingh: Revert "Revert "magru: depool geoip/text*"" [dns] - 10https://gerrit.wikimedia.org/r/1026627 [17:58:03] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [18:00:04] jnuche and brennen: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1800). [18:00:20] (03PS18) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) [18:01:14] (03PS1) 10Ssingh: magru: continue with depooled in admin_state [dns] - 10https://gerrit.wikimedia.org/r/1026655 [18:01:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T361627)', diff saved to https://phabricator.wikimedia.org/P61756 and previous config saved to /var/cache/conftool/dbconfig/20240502-180136-marostegui.json [18:01:39] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:02:34] (03PS3) 10Jforrester: Pre-emptively disable DiscussionToolsEnableThanks (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026511 (owner: 10Esanders) [18:02:40] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:02:43] (03CR) 10Ssingh: [C:03+2] magru: continue with depooled in admin_state [dns] - 10https://gerrit.wikimedia.org/r/1026655 (owner: 10Ssingh) [18:03:24] (03PS1) 10Ssingh: fix typo in admin_state [dns] - 10https://gerrit.wikimedia.org/r/1026656 [18:05:02] (03CR) 10Ssingh: [C:03+2] fix typo in admin_state [dns] - 10https://gerrit.wikimedia.org/r/1026656 (owner: 10Ssingh) [18:05:18] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Apply updated JDK 8 - eevans@cumin1002 [18:05:34] (03CR) 10CDobbins: [V:03+1] purged: add PKI cert handling (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:07:40] (03CR) 10Dzahn: [V:03+2 C:03+2] delete fake key for query-preview.wikidata.org [labs/private] - 10https://gerrit.wikimedia.org/r/1026623 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [18:08:49] (03CR) 10Dzahn: [V:03+2 C:03+2] "instance doesn't exist anymore" [labs/private] - 10https://gerrit.wikimedia.org/r/1026649 (owner: 10Dzahn) [18:08:55] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh7001.wikimedia.org - sukhe@cumin1002" [18:09:49] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh7001.wikimedia.org - sukhe@cumin1002" [18:09:50] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:09:50] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache doh7001.wikimedia.org on all recursors [18:09:53] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7001.wikimedia.org on all recursors [18:09:56] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh7001.wikimedia.org [18:10:02] !log sukhe@cumin1002 START - Cookbook sre.ganeti.makevm for new host doh7001.wikimedia.org [18:10:03] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [18:11:37] !log magru: setting weights on cp servers and pooling [18:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:57] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:16:31] RECOVERY - PyBal IPVS diff check on lvs7003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:16:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P61758 and previous config saved to /var/cache/conftool/dbconfig/20240502-181643-marostegui.json [18:17:01] RECOVERY - PyBal IPVS diff check on lvs7002 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:17:19] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7001.magru.wmnet - brett@cumin2002" [18:17:23] RECOVERY - PyBal IPVS diff check on lvs7001 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:18:12] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7001.magru.wmnet - brett@cumin2002" [18:18:12] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:18:12] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir7001.magru.wmnet on all recursors [18:18:15] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7001.magru.wmnet on all recursors [18:18:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:18:37] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7001.magru.wmnet - brett@cumin2002" [18:19:32] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7001.magru.wmnet - brett@cumin2002" [18:20:02] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir7001.magru.wmnet with OS bookworm [18:20:42] FIRING: [26x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:22:44] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7001.wikimedia.org - sukhe@cumin1002" [18:22:48] !log sudo cumin -b1 -s900 "A:dnsbox" "systemctl restart ntp.service" [18:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:36] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7001.wikimedia.org - sukhe@cumin1002" [18:23:36] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:23:36] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache doh7001.wikimedia.org on all recursors [18:23:40] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7001.wikimedia.org on all recursors [18:24:08] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7001.wikimedia.org - sukhe@cumin1002" [18:31:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P61759 and previous config saved to /var/cache/conftool/dbconfig/20240502-183151-marostegui.json [18:33:48] (03PS1) 10Andrea Denisse: logstash: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) [18:35:50] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Apply updated JDK 8 - eevans@cumin1002 [18:40:56] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7001.wikimedia.org - sukhe@cumin1002" [18:41:15] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host doh7001.wikimedia.org with OS bookworm [18:42:13] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1005 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:46:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T361627)', diff saved to https://phabricator.wikimedia.org/P61760 and previous config saved to /var/cache/conftool/dbconfig/20240502-184658-marostegui.json [18:47:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1242.eqiad.wmnet with reason: Maintenance [18:47:02] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:47:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1242.eqiad.wmnet with reason: Maintenance [18:47:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T361627)', diff saved to https://phabricator.wikimedia.org/P61761 and previous config saved to /var/cache/conftool/dbconfig/20240502-184710-marostegui.json [18:48:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:49:10] (03PS13) 10Bking: search-platform: monitoring/alert on upstream MW API errors [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [18:49:53] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 94716736 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:50:35] (03PS14) 10Bking: search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [18:50:40] (03CR) 10CI reject: [V:04-1] search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [18:50:55] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 26576 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:51:33] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:51:41] (03CR) 10CI reject: [V:04-1] search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [18:53:18] (03PS15) 10Bking: search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [18:54:34] (03CR) 10CI reject: [V:04-1] search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [18:56:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P61762 and previous config saved to /var/cache/conftool/dbconfig/20240502-185609-ladsgroup.json [18:57:34] (03PS16) 10Bking: search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [18:58:40] (03CR) 10CI reject: [V:04-1] search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [18:58:54] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:59:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T361627)', diff saved to https://phabricator.wikimedia.org/P61763 and previous config saved to /var/cache/conftool/dbconfig/20240502-185926-marostegui.json [18:59:31] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:03:46] (03PS17) 10Bking: search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [19:04:53] (03CR) 10CI reject: [V:04-1] search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [19:07:19] (03PS1) 10Jdrewniak: Deploy Vector appearance menu and increased font-size to plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026663 (https://phabricator.wikimedia.org/T362147) [19:07:35] (03PS18) 10Bking: search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [19:08:09] jouncebot: nowandnext [19:08:09] For the next 0 hour(s) and 51 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T1800) [19:08:09] In 0 hour(s) and 51 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T2000) [19:08:42] (03CR) 10CI reject: [V:04-1] search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [19:08:48] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh7001.wikimedia.org with reason: host reimage [19:11:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P61764 and previous config saved to /var/cache/conftool/dbconfig/20240502-191115-ladsgroup.json [19:11:58] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7001.wikimedia.org with reason: host reimage [19:12:19] (03PS19) 10Bking: search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [19:13:25] (03CR) 10CI reject: [V:04-1] search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [19:14:33] jnuche: brennen: are you using this deploy window? [19:14:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P61765 and previous config saved to /var/cache/conftool/dbconfig/20240502-191434-marostegui.json [19:16:39] (03PS20) 10Bking: search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [19:16:49] (03CR) 10Andrew Bogott: "The short version is: g10k copies every branch even though it only ever deploys the 'production' branch. If that branch then goes away in " [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (owner: 10Andrew Bogott) [19:17:46] (03CR) 10CI reject: [V:04-1] search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [19:20:11] (03PS21) 10Bking: search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [19:21:17] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [19:21:46] (03CR) 10Bking: search-platform: monitor/alert on elastic request failures (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [19:21:55] (03CR) 10Ryan Kemper: search-platform: monitor/alert on elastic request failures (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [19:22:13] (03CR) 10Ryan Kemper: [C:03+1] search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [19:22:21] (03CR) 10Bking: [C:03+2] search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [19:23:26] cdanis: I'd say you're safe [19:23:44] thanks dancy <3 [19:24:17] (03Merged) 10jenkins-bot: search-platform: monitor/alert on elastic request failures [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [19:26:08] actually dancy, if you're around, and you have +2 in mediawiki, do you mind stamping https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1021521 ? [19:26:18] Taking a look [19:26:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: Maint over', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20240502-192621-ladsgroup.json [19:26:53] https://measure-magru.wikimedia.org/measure returns an error right now. Expected? [19:27:54] (Error: 421, Misdirected Request) [19:28:07] hm, I'll fix that [19:28:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:28:49] FIRING: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:26] !incidents [19:29:26] 4649 (UNACKED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [19:29:27] 4648 (RESOLVED) db1175 (paged)/MariaDB Replica SQL: s3 (paged) [19:29:27] 4647 (RESOLVED) db1189 (paged)/MariaDB Replica SQL: s3 (paged) [19:29:31] o/ [19:29:34] !ack 4649 [19:29:35] 4649 (ACKED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [19:29:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P61766 and previous config saved to /var/cache/conftool/dbconfig/20240502-192942-marostegui.json [19:29:53] (03PS1) 10CDanis: Add magru to measure_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1026667 (https://phabricator.wikimedia.org/T362902) [19:29:59] herron: known? [19:30:03] phab seems fine? still looking [19:30:12] nod, taking a look as well [19:30:24] it was briefly down for me just before that fired, but it is fine now [19:30:29] phab is up for me but has seen some load 5 minutes ago [19:31:22] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [19:31:26] we had that in the past multiple times and worked on it in https://phabricator.wikimedia.org/T362401. let me check superset what caused the spike [19:32:46] !log amastilovic@deploy1002 Started deploy [airflow-dags/analytics@4edc35c]: (no justification provided) [19:32:48] dancy: I've prepped a puppet patch that will fix that, I'm about to submit and merge, I won't scap backport the Mediawiki patch until the puppet patch is deployed [19:33:03] ok. I'll +2 in the meantime [19:33:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:25] !log amastilovic@deploy1002 Finished deploy [airflow-dags/analytics@4edc35c]: (no justification provided) (duration: 00m 38s) [19:33:30] thanks! [19:33:36] (03PS1) 10Ryan Kemper: wdqs: enable nfs data reloads on wdqs1021 [puppet] - 10https://gerrit.wikimedia.org/r/1026668 (https://phabricator.wikimedia.org/T362920) [19:33:49] RESOLVED: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:34:48] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026668 (https://phabricator.wikimedia.org/T362920) (owner: 10Ryan Kemper) [19:36:21] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7001.wikimedia.org with OS bookworm [19:36:21] !log sukhe@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7001.wikimedia.org [19:36:33] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [19:36:35] (03CR) 10Fabfur: [C:03+1] "not the reviewer but I like it!" [puppet] - 10https://gerrit.wikimedia.org/r/1026667 (https://phabricator.wikimedia.org/T362902) (owner: 10CDanis) [19:37:04] (03CR) 10CDanis: [C:03+2] Add magru to measure_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1026667 (https://phabricator.wikimedia.org/T362902) (owner: 10CDanis) [19:38:07] (03CR) 10Andrew Bogott: "It's easy to configure g10k to prune out any code that isn't actually being deployed on the current run, but that prune is not run by defa" [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (owner: 10Andrew Bogott) [19:39:38] herro, jhathaway: I reopened https://phabricator.wikimedia.org/T362401 for the phab incident. I can take a closer look at normal working hours. [19:39:49] herron ^ typo [19:40:16] thanks jelto [19:41:00] +1 thank you! [19:41:10] (03PS1) 10CDanis: probenet: add magru measurement endpoint [extensions/WikimediaEvents] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1026628 (https://phabricator.wikimedia.org/T362902) [19:41:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P61767 and previous config saved to /var/cache/conftool/dbconfig/20240502-194127-ladsgroup.json [19:42:41] (03CR) 10CDanis: [C:03+2] probenet: add magru measurement endpoint [extensions/WikimediaEvents] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1026628 (https://phabricator.wikimedia.org/T362902) (owner: 10CDanis) [19:44:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T361627)', diff saved to https://phabricator.wikimedia.org/P61768 and previous config saved to /var/cache/conftool/dbconfig/20240502-194450-marostegui.json [19:44:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1243.eqiad.wmnet with reason: Maintenance [19:44:54] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:45:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1243.eqiad.wmnet with reason: Maintenance [19:45:11] (03Merged) 10jenkins-bot: probenet: add magru measurement endpoint [extensions/WikimediaEvents] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1026628 (https://phabricator.wikimedia.org/T362902) (owner: 10CDanis) [19:45:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T361627)', diff saved to https://phabricator.wikimedia.org/P61769 and previous config saved to /var/cache/conftool/dbconfig/20240502-194513-marostegui.json [19:45:50] !log cdanis@deploy1002 Started scap: Backport for [[gerrit:1026628|probenet: add magru measurement endpoint (T362902)]] [19:45:54] T362902: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902 [19:48:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:49:03] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncredir7001.magru.wmnet with OS bookworm [19:49:04] !log brett@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir7001.magru.wmnet [19:49:10] (03CR) 10Ecarg: "Cool, so this change will allow Prometheus to read the data?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026441 (https://phabricator.wikimedia.org/T350034) (owner: 10JMeybohm) [19:50:31] !log cdanis@deploy1002 cdanis: Backport for [[gerrit:1026628|probenet: add magru measurement endpoint (T362902)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:50:33] !log cdanis@deploy1002 cdanis: Continuing with sync [19:51:33] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [19:51:33] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [19:55:30] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9765182 (10Jclark-ctr) a:03Papaul [19:55:33] PROBLEM - Check whether ferm is active by checking the default input chain on mw2388 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:56:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T361627)', diff saved to https://phabricator.wikimedia.org/P61770 and previous config saved to /var/cache/conftool/dbconfig/20240502-195623-marostegui.json [19:56:27] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:56:35] PROBLEM - Check whether ferm is active by checking the default input chain on mw1483 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:56:35] PROBLEM - Check whether ferm is active by checking the default input chain on mw1385 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:56:59] uhh [19:57:14] (03PS6) 10Jdlrobson: Update wgVectorClientPrefs to wgVectorAppearance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) (owner: 10Bernard Wang) [19:57:23] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240502T2000). Please do the needful. [20:00:05] jan_drewniak and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:21] two jans! [20:00:37] urbanecm: my scap backport is just finishing sync-apaches now, btw [20:00:45] ack [20:00:57] I thought 15 minutes would be enough [20:01:01] np [20:01:09] it's more of 20 those days :/ [20:01:18] * urbanecm misses the days when scap sync-file took less than a minute [20:01:52] jan_drewniak: hi! i presume you'll self-deploy (once c.danis's deployment finishes)? [20:02:15] urbanecm: yup that's right, just a few config changes today. [20:04:09] !log cdanis@deploy1002 Finished scap: Backport for [[gerrit:1026628|probenet: add magru measurement endpoint (T362902)]] (duration: 18m 19s) [20:04:12] T362902: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902 [20:04:37] all yours [20:05:45] cdanis: thanks! [20:06:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:07:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) (owner: 10Bernard Wang) [20:07:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026663 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdrewniak) [20:07:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:52] (03Merged) 10jenkins-bot: Update wgVectorClientPrefs to wgVectorAppearance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) (owner: 10Bernard Wang) [20:08:40] (03PS2) 10Jdrewniak: Deploy Vector appearance menu and increased font-size to plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026663 (https://phabricator.wikimedia.org/T362147) [20:08:50] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026663 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdrewniak) [20:09:03] 06SRE, 06Infrastructure-Foundations, 10netops, 10MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), 13Patch-For-Review: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9765219 (10CDanis) 05Open→03Resolved [20:09:32] (03Merged) 10jenkins-bot: Deploy Vector appearance menu and increased font-size to plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026663 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdrewniak) [20:09:44] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9765224 (10CDanis) [20:09:46] 06SRE, 06Infrastructure-Foundations, 10netops, 10MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), 13Patch-For-Review: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9765222 (10CDanis) [20:09:49] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1023928|Update wgVectorClientPrefs to wgVectorAppearance (T362808)]], [[gerrit:1026663|Deploy Vector appearance menu and increased font-size to plwiki (T362147)]] [20:09:59] T362808: Rename "client preferences menu" to "appearances menu" - https://phabricator.wikimedia.org/T362808 [20:10:00] T362147: Deploy reading accessibility settings menu and new typography defaults to first set of wikis - https://phabricator.wikimedia.org/T362147 [20:11:31] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:11:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P61771 and previous config saved to /var/cache/conftool/dbconfig/20240502-201131-marostegui.json [20:14:12] just fyi: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out [20:14:27] !log jdrewniak@deploy1002 bwang and jdrewniak: Backport for [[gerrit:1023928|Update wgVectorClientPrefs to wgVectorAppearance (T362808)]], [[gerrit:1026663|Deploy Vector appearance menu and increased font-size to plwiki (T362147)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:14:38] (03PS1) 10BCornwall: magru: add ncredir7001 and ncredir7002 nodes [puppet] - 10https://gerrit.wikimedia.org/r/1026673 (https://phabricator.wikimedia.org/T362729) [20:15:33] jan_drewniak: https://phabricator.wikimedia.org/T362938#9764496 [20:15:52] cdanis: is mw2382 properly depooled? [20:20:08] (03PS1) 10BCornwall: lvs: Add ncredir7001/ncredir7002 (service_setup) [puppet] - 10https://gerrit.wikimedia.org/r/1026674 (https://phabricator.wikimedia.org/T362729) [20:21:25] !log jdrewniak@deploy1002 Sync cancelled. [20:21:47] jan_drewniak: I think you can just continue with the sync while I investigate that [20:22:11] the step that failed for me was docker_pull_k8s [20:22:23] but if the host isn't answering ssh there's no way that there's k8s pods running there [20:22:30] so scap errors but it's not actually harmful [20:23:03] cdanis: no problem, I think there was an issue with one of our config patches anyway, after testing on mwdebug on the testwiki the feature disappeared (not intended) but I don't think it's a server issue. [20:23:19] 👍 [20:23:23] cdanis: it's not even a k8s node now, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026446/ from jaym.e earlier today [20:24:19] (03PS1) 10Jdrewniak: Revert "Deploy Vector appearance menu and increased font-size to plwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026629 [20:24:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026629 (owner: 10Jdrewniak) [20:25:30] (03Merged) 10jenkins-bot: Revert "Deploy Vector appearance menu and increased font-size to plwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026629 (owner: 10Jdrewniak) [20:25:33] RECOVERY - Check whether ferm is active by checking the default input chain on mw2388 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:25:34] {"mw2382.codfw.wmnet": {"weight": 10, "pooled": "inactive"}, "tags": "dc=codfw,cluster=kubernetes,service=kubesvc"} [20:25:42] Since I tried deploying both at once, I'll revert the last patch, test, see if that works (if it does I'll sync, if not then I'll revert the second patch too). [20:25:45] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1026629|Revert "Deploy Vector appearance menu and increased font-size to plwiki"]] [20:25:45] it's set pooled=inactive so I'm not even sure why scap cares about it [20:26:35] RECOVERY - Check whether ferm is active by checking the default input chain on mw1483 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:26:35] RECOVERY - Check whether ferm is active by checking the default input chain on mw1385 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:26:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P61772 and previous config saved to /var/cache/conftool/dbconfig/20240502-202638-marostegui.json [20:27:23] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:30:22] !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:1026629|Revert "Deploy Vector appearance menu and increased font-size to plwiki"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:32:05] !log jdrewniak@deploy1002 Sync cancelled. [20:32:54] (03PS1) 10Jdrewniak: Revert "Update wgVectorClientPrefs to wgVectorAppearance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026630 [20:33:05] (03CR) 10Jdrewniak: [C:03+2] Revert "Update wgVectorClientPrefs to wgVectorAppearance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026630 (owner: 10Jdrewniak) [20:33:50] Ok, that first patch didn't work on testwiki so I'm reverting that too. [20:33:52] (03Merged) 10jenkins-bot: Revert "Update wgVectorClientPrefs to wgVectorAppearance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026630 (owner: 10Jdrewniak) [20:36:33] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:36:33] 2 commits, 2 reverts - essentially a no-op. I ran git pull on staging so it's clean. I guess that's it for me today. [20:39:17] (03CR) 10BCornwall: [C:03+2] magru: add ncredir configuration [puppet] - 10https://gerrit.wikimedia.org/r/1026526 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [20:40:51] (03Abandoned) 10BCornwall: lvs: Add ncredir7001/ncredir7002 (service_setup) [puppet] - 10https://gerrit.wikimedia.org/r/1026674 (https://phabricator.wikimedia.org/T362729) (owner: 10BCornwall) [20:40:55] (03Abandoned) 10BCornwall: magru: add ncredir7001 and ncredir7002 nodes [puppet] - 10https://gerrit.wikimedia.org/r/1026673 (https://phabricator.wikimedia.org/T362729) (owner: 10BCornwall) [20:41:42] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir7001.magru.wmnet with OS bookworm [20:41:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T361627)', diff saved to https://phabricator.wikimedia.org/P61773 and previous config saved to /var/cache/conftool/dbconfig/20240502-204146-marostegui.json [20:41:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1244.eqiad.wmnet with reason: Maintenance [20:41:49] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [20:42:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1244.eqiad.wmnet with reason: Maintenance [20:42:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T361627)', diff saved to https://phabricator.wikimedia.org/P61774 and previous config saved to /var/cache/conftool/dbconfig/20240502-204208-marostegui.json [20:47:37] 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9765291 (10CDanis) That sounds good to me @elukey . I don't think a new intermediate is needed. I think perhaps the main issue underlying this task was the misunderstanding... [20:50:42] FIRING: [38x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:51:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T361627)', diff saved to https://phabricator.wikimedia.org/P61775 and previous config saved to /var/cache/conftool/dbconfig/20240502-205105-marostegui.json [20:51:09] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [20:52:19] PROBLEM - PyBal connections to etcd on lvs7001 is CRITICAL: CRITICAL: 8 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [20:52:23] (03CR) 10EoghanGaffney: [C:03+1] devtools: update gerrit and phab instance names in default Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1026197 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [20:53:20] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Apply updated JDK 8 - eevans@cumin1002 [20:54:23] PROBLEM - PyBal connections to etcd on lvs7003 is CRITICAL: CRITICAL: 12 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [20:54:23] PROBLEM - PyBal IPVS diff check on lvs7001 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:58:35] PROBLEM - PyBal IPVS diff check on lvs7003 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:02:00] (03PS1) 10JHathaway: puppetserver: change prometheus port from ipv6 to ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/1026681 (https://phabricator.wikimedia.org/T337970) [21:06:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P61776 and previous config saved to /var/cache/conftool/dbconfig/20240502-210613-marostegui.json [21:07:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026681 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [21:10:38] (03PS1) 10Andrew Bogott: puppetserver-deploy-code: bail out if current branch is not 'production' [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) [21:10:42] FIRING: [80x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:11:32] (03PS2) 10Andrew Bogott: puppetserver-deploy-code: add -force to g10k call to invoke purging [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (https://phabricator.wikimedia.org/T364047) [21:12:28] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Apply updated JDK 8 - eevans@cumin1002 [21:13:48] (03PS2) 10JHathaway: puppetserver: change prometheus port from ipv6 to ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/1026681 (https://phabricator.wikimedia.org/T337970) [21:14:02] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026681 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [21:15:51] (03CR) 10JHathaway: [C:03+2] puppetserver: change prometheus port from ipv6 to ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/1026681 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [21:19:25] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7001.magru.wmnet with reason: host reimage [21:21:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P61777 and previous config saved to /var/cache/conftool/dbconfig/20240502-212123-marostegui.json [21:22:18] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7001.magru.wmnet with reason: host reimage [21:26:32] (03PS1) 10Superzerocool: eswiki, commonswiki wikidatawiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026691 (https://phabricator.wikimedia.org/T364039) [21:36:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T361627)', diff saved to https://phabricator.wikimedia.org/P61778 and previous config saved to /var/cache/conftool/dbconfig/20240502-213631-marostegui.json [21:36:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [21:36:34] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [21:36:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [21:44:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1247.eqiad.wmnet with reason: Maintenance [21:44:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1247.eqiad.wmnet with reason: Maintenance [21:44:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T361627)', diff saved to https://phabricator.wikimedia.org/P61779 and previous config saved to /var/cache/conftool/dbconfig/20240502-214435-marostegui.json [21:44:39] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [21:46:04] (03PS1) 10Andrea Denisse: wmcs: Remove unnecesary kibana and kibana-discovery certificates [puppet] - 10https://gerrit.wikimedia.org/r/1026692 (https://phabricator.wikimedia.org/T360414) [21:48:55] (03CR) 10Andrea Denisse: "To be merged after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1025879" [puppet] - 10https://gerrit.wikimedia.org/r/1026692 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:55:31] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7001.magru.wmnet with OS bookworm [21:56:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T361627)', diff saved to https://phabricator.wikimedia.org/P61780 and previous config saved to /var/cache/conftool/dbconfig/20240502-215641-marostegui.json [21:56:45] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [22:04:54] (03CR) 10JHathaway: "The part I still don't grok is why we don't have this issue in prod. Why aren't there a bunch of extra directories?" [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [22:07:05] (03CR) 10Dzahn: [C:03+2] devtools: update gerrit and phab instance names in default Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1026197 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [22:07:40] (03PS1) 10Andrea Denisse: ssl: Remove unnecessary dummy key for the kibana hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1026693 (https://phabricator.wikimedia.org/T360414) [22:08:14] (03CR) 10Andrea Denisse: "To be merged after the CFSSL migration." [labs/private] - 10https://gerrit.wikimedia.org/r/1026693 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [22:11:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P61781 and previous config saved to /var/cache/conftool/dbconfig/20240502-221149-marostegui.json [22:17:02] (03CR) 10Dzahn: "Which hosts would you point logs-api.disccovery.wmnet to? That still has to be created." [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [22:19:56] (03CR) 10Dzahn: [C:03+2] Automate quarterly Phabricator data for WMF QLS [puppet] - 10https://gerrit.wikimedia.org/r/1024348 (https://phabricator.wikimedia.org/T362804) (owner: 10Aklapper) [22:20:39] (03CR) 10Andrea Denisse: "Yes, they'll be added as part of https://phabricator.wikimedia.org/T356386" [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [22:22:32] (03CR) 10Dzahn: [V:03+2 C:03+2] delete bugzilla.wikimedia.org.key [labs/private] - 10https://gerrit.wikimedia.org/r/1026653 (owner: 10Dzahn) [22:26:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P61782 and previous config saved to /var/cache/conftool/dbconfig/20240502-222656-marostegui.json [22:28:59] (03CR) 10Dzahn: "and no more kibana.discovery.wmnet, kibaba.svc and kibana-next.svc records?" [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [22:30:00] (03CR) 10Dzahn: [C:03+1] "it matches what was said on the ticket which names are still needed and which aren't" [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [22:30:35] (03CR) 10Andrea Denisse: "They're marked as required by my team, that's why I added them. :)" [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [22:31:51] (03CR) 10Dzahn: [C:03+1] "note: one of the few that explicitly state the "sni_support: yes"" [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [22:37:51] (03CR) 10Dzahn: [V:03+1 C:03+2] "I ran this once with an edited recipient list, just Andre and myself, and got mail with content. then puppet adjust the recipient list aga" [puppet] - 10https://gerrit.wikimedia.org/r/1024348 (https://phabricator.wikimedia.org/T362804) (owner: 10Aklapper) [22:42:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T361627)', diff saved to https://phabricator.wikimedia.org/P61783 and previous config saved to /var/cache/conftool/dbconfig/20240502-224204-marostegui.json [22:42:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1248.eqiad.wmnet with reason: Maintenance [22:42:07] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [22:42:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1248.eqiad.wmnet with reason: Maintenance [22:42:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T361627)', diff saved to https://phabricator.wikimedia.org/P61784 and previous config saved to /var/cache/conftool/dbconfig/20240502-224227-marostegui.json [22:43:25] (03CR) 10Andrew Bogott: "I think the difference for the specific issue I'm seeing is explained in the attached task; the more general case is that wmcs puppetserve" [puppet] - 10https://gerrit.wikimedia.org/r/1025818 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [22:43:59] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Apply updated JDK 8 - eevans@cumin1002 [22:46:01] (03PS1) 10Dzahn: cloud/devtools: replace deploy-1004 with deploy-1006 [puppet] - 10https://gerrit.wikimedia.org/r/1026698 (https://phabricator.wikimedia.org/T360964) [22:49:26] (03PS2) 10Andrew Bogott: puppetserver-deploy-code: bail out if current branch is not 'production' [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) [22:50:06] (03PS3) 10Andrew Bogott: puppetserver-deploy-code: bail out if current branch is not 'production' [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) [22:54:03] (03PS1) 10EoghanGaffney: apt-staging: Add timer for gitlab package puller job [puppet] - 10https://gerrit.wikimedia.org/r/1026699 (https://phabricator.wikimedia.org/T347004) [22:55:42] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2235/co" [puppet] - 10https://gerrit.wikimedia.org/r/1026699 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [22:58:54] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:59:06] (03CR) 10Dzahn: [C:03+2] cloud/devtools: replace deploy-1004 with deploy-1006 [puppet] - 10https://gerrit.wikimedia.org/r/1026698 (https://phabricator.wikimedia.org/T360964) (owner: 10Dzahn) [23:02:32] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9765644 (10Jclark-ctr) @akosiaris idrac has stayed up for 4 days now possibly me relocating to a different port helped it. We wont know until it is put in use again. this server is out of warranty if it f... [23:14:41] (03PS2) 10Superzerocool: eswiki, commonswiki wikidatawiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026691 (https://phabricator.wikimedia.org/T364039) [23:15:14] (03CR) 10Superzerocool: [C:03+1] eswiki, commonswiki wikidatawiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026691 (https://phabricator.wikimedia.org/T364039) (owner: 10Superzerocool) [23:20:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T361627)', diff saved to https://phabricator.wikimedia.org/P61785 and previous config saved to /var/cache/conftool/dbconfig/20240502-232037-marostegui.json [23:20:41] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [23:33:41] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Apply updated JDK 8 - eevans@cumin1002 [23:35:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P61786 and previous config saved to /var/cache/conftool/dbconfig/20240502-233545-marostegui.json [23:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1025915 [23:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1025915 (owner: 10TrainBranchBot) [23:48:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:50:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P61787 and previous config saved to /var/cache/conftool/dbconfig/20240502-235053-marostegui.json