[00:04:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072928 (owner: 10TrainBranchBot) [00:10:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:15:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:55:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10147257 (10phaultfinder) [01:10:13] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10147258 (10phaultfinder) [02:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:39:13] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:13] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:36:56] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:36:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:37:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.311 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:37:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52630 bytes in 0.420 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1125.eqiad.wmnet with reason: testing node [06:05:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1125.eqiad.wmnet with reason: testing node [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:24:03] !log installing git security updates [06:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 217, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:10] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:40:52] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:41:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 217, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:42:32] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 39 probes of 770 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:47:34] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 33 probes of 770 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:54:52] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:10] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:59:13] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10147386 (10Vgutierrez) gentle reminder, this is still waiting for @VPuffetMichel approval [07:15:00] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1073035 (https://phabricator.wikimedia.org/T374804) [07:16:52] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1073037 (https://phabricator.wikimedia.org/T374805) [07:17:18] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1073038 (https://phabricator.wikimedia.org/T374806) [07:17:20] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: remove frban2001 for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1072813 (https://phabricator.wikimedia.org/T374741) (owner: 10Dwisehaupt) [07:18:00] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1073039 (https://phabricator.wikimedia.org/T374807) [07:22:45] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374808 (10ops-monitoring-bot) 03NEW [07:23:02] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10147444 (10ABran-WMF) [] db2129: cm s6 T374806→switchback [] db2140: m s4 T374804 [] db2218: m s7 T374807 [07:23:54] !log elukey@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw [07:23:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10147440 (10ABran-WMF) [] db2213: m s5 T374805 [] db2214: m s6 T374806 [07:24:10] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10147456 (10ABran-WMF) [] db2220: cm s7 T374807→switchback [07:27:51] (03PS1) 10Elukey: Set puppet7 for chartmuseum2001 [puppet] - 10https://gerrit.wikimedia.org/r/1073107 (https://phabricator.wikimedia.org/T331969) [07:29:49] (03CR) 10Elukey: [C:03+2] services: update thumbor-eqiad to poolcounter1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072716 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [07:33:04] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync [07:33:09] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync [07:33:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s5 T374805 [07:33:20] T374805: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T374805 [07:33:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T374805 [07:35:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2123 from API/vslow/dump T374805', diff saved to https://phabricator.wikimedia.org/P69126 and previous config saved to /var/cache/conftool/dbconfig/20240916-073521-arnaudb.json [07:36:10] (03CR) 10Brouberol: flink-app: customize calico label selector (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [07:39:50] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [07:40:28] (03CR) 10Elukey: [C:03+2] Set puppet7 for chartmuseum2001 [puppet] - 10https://gerrit.wikimedia.org/r/1073107 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [07:40:30] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1073037 (https://phabricator.wikimedia.org/T374805) (owner: 10Gerrit maintenance bot) [07:41:02] go for it elukey [07:41:06] arnaudb: ack! [07:42:12] neat [07:42:19] thanks [07:42:34] !log Starting s5 codfw failover from db2213 to db2123 - T374805 [07:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:38] T374805: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T374805 [07:43:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2123 to s5 primary T374805', diff saved to https://phabricator.wikimedia.org/P69128 and previous config saved to /var/cache/conftool/dbconfig/20240916-074312-arnaudb.json [07:43:32] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [07:45:10] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum2001.codfw.wmnet with OS bookworm [07:45:23] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969#10147504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm [07:47:39] (03CR) 10Elukey: [V:03+2 C:03+2] Update the Debian changelog to build on Bookworm [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [07:48:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s6 T374806 [07:48:44] T374806: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T374806 [07:49:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s6 T374806 [07:51:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 T374805', diff saved to https://phabricator.wikimedia.org/P69129 and previous config saved to /var/cache/conftool/dbconfig/20240916-075059-arnaudb.json [07:51:05] T374805: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T374805 [07:53:50] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147518 (10MoritzMuehlenhoff) [07:54:08] (03CR) 10CI reject: [V:04-1] Update the Debian changelog to build on Bookworm [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [07:54:13] FIRING: [2x] JobUnavailable: Reduced availability for job chartmuseum in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:56:49] (03CR) 10Muehlenhoff: [C:03+2] CAS: Disable memcached on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1070899 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [07:59:37] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1073038 (https://phabricator.wikimedia.org/T374806) (owner: 10Gerrit maintenance bot) [08:01:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2129 to s6 primary T374806', diff saved to https://phabricator.wikimedia.org/P69130 and previous config saved to /var/cache/conftool/dbconfig/20240916-080132-arnaudb.json [08:01:37] T374806: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T374806 [08:03:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 T374806', diff saved to https://phabricator.wikimedia.org/P69131 and previous config saved to /var/cache/conftool/dbconfig/20240916-080342-arnaudb.json [08:07:10] (03CR) 10MVernon: [C:04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1071609 (owner: 10Muehlenhoff) [08:08:41] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host chartmuseum2001.codfw.wmnet with OS bookworm [08:08:50] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147562 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm executed with errors: - chartmuseum2001 (... [08:09:09] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum2001.codfw.wmnet with OS bookworm [08:09:20] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm [08:12:37] (03PS1) 10Filippo Giunchedi: thanos: trim 5m retention to 35w [puppet] - 10https://gerrit.wikimedia.org/r/1073147 (https://phabricator.wikimedia.org/T351927) [08:13:29] (03CR) 10DCausse: flink-app: customize calico label selector (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [08:13:45] 10SRE-swift-storage, 06Commons: 404 error opening a specific file on Commons - https://phabricator.wikimedia.org/T374773#10147579 (10KTT-Commons) Update: As of 16:00 UTC+8, I can now access the file without problem in Hong Kong. Will like to hear if anyone elsewhere still has trouble in accessing the file? [08:14:11] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: trim 5m retention to 35w [puppet] - 10https://gerrit.wikimedia.org/r/1073147 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [08:14:45] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host chartmuseum2001.codfw.wmnet with OS bookworm [08:14:57] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm executed with errors: - chartmuseum2001 (... [08:15:07] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum2001.codfw.wmnet with OS bookworm [08:15:18] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm [08:18:53] (03CR) 10Volans: [C:03+2] test-cookbook: read spicerack config with sudo [puppet] - 10https://gerrit.wikimedia.org/r/1071810 (owner: 10Volans) [08:19:13] FIRING: [2x] JobUnavailable: Reduced availability for job chartmuseum in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:19:22] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host chartmuseum2001.codfw.wmnet with OS bookworm [08:19:32] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm executed with errors: - chartmuseum2001 (... [08:24:58] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum2001.codfw.wmnet with OS bookworm [08:25:13] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147634 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm [08:33:04] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host chartmuseum2001.codfw.wmnet with OS bookworm [08:33:20] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147647 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm executed with errors: - chartmuseum2001 (... [08:37:10] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum2001.codfw.wmnet with OS bookworm [08:37:22] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147668 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm [08:38:16] (03Abandoned) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [08:38:33] (03Abandoned) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [08:38:37] !log bump memory allocation of chartmuseum1001/2001 to 2G (Bookworm fails to install with just 1G) T331969 [08:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:40] T331969: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969 [08:39:09] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147672 (10elukey) Due to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1035854, the VM's RAM was bumped to 2G. [08:45:26] (03PS1) 10Santiago Faci: MPIC: New deployment (v0.1.5) to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073152 (https://phabricator.wikimedia.org/T361346) [08:53:00] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on chartmuseum2001.codfw.wmnet with reason: host reimage [08:53:18] (03PS1) 10Gerrit maintenance bot: Add kge to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1073154 (https://phabricator.wikimedia.org/T374813) [08:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:55:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on chartmuseum2001.codfw.wmnet with reason: host reimage [08:56:04] jouncebot: next [08:56:04] In 1 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1000) [08:57:11] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10147815 (10ArthurTaylor) I'm happy to use `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5... [08:57:21] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10147819 (10phaultfinder) [08:57:25] 06SRE, 06Infrastructure-Foundations, 10netops: Enable BFD on 'core' EBGP peerings from L3 switches to CRs - https://phabricator.wikimedia.org/T374452#10147824 (10ayounsi) Not sure it's worth it for direct (short) links. The tradeoff is to rely on an extra protocol, extra config, and adding load on the device... [08:58:05] 10SRE-swift-storage, 06Commons: 404 error opening a specific file on Commons - https://phabricator.wikimedia.org/T374773#10147831 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I've confirmed that both eqiad and codfw swift clusters have this object. They arrived at different times, however:... [09:01:53] (03CR) 10DCausse: [C:03+1] Add ORKG triplestore to WDQS federation allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1072723 (https://phabricator.wikimedia.org/T366485) (owner: 10Btullis) [09:02:34] 06SRE, 06Infrastructure-Foundations, 10netops: ToR server-move Netbox script adding ".0" to end of interface names - https://phabricator.wikimedia.org/T374024#10147850 (10ayounsi) @cmooney thanks for your patch ! is there something left to do on this ? [09:03:14] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [09:03:26] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrading mariadb on clouddb1013 T365424 [09:03:30] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [09:03:42] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrading mariadb on clouddb1013 T365424 [09:03:56] (03CR) 10Elukey: [C:03+2] services: add new poolcounter nodes to MW configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072717 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [09:04:06] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s1 [09:04:08] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3 [09:09:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10147869 (10elukey) @Jhancock.wm you are totally right, thanks a lot! I was able to force PXE on a 10G port setting the the first `RSC-W-66G4` option to `Legacy`. I hope... [09:09:06] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1013.eqiad.wmnet [09:12:24] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1013.eqiad.wmnet [09:12:30] PROBLEM - Host clouddb1013 is DOWN: PING CRITICAL - Packet loss = 100% [09:12:36] RECOVERY - Host clouddb1013 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [09:12:42] PROBLEM - mysqld processes on clouddb1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:12:46] PROBLEM - MariaDB Replica IO: s1 on clouddb1013 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:13:14] PROBLEM - MariaDB Replica IO: s3 on clouddb1013 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:13:14] PROBLEM - MariaDB Replica SQL: s1 on clouddb1013 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:13:30] PROBLEM - MariaDB Replica SQL: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:13:32] PROBLEM - MariaDB read only s1 on clouddb1013 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:13:32] PROBLEM - MariaDB read only wikireplica-s1 on clouddb1013 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:13:34] PROBLEM - MariaDB read only s3 on clouddb1013 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:13:34] PROBLEM - MariaDB read only wikireplica-s3 on clouddb1013 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:14:26] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [09:15:14] RECOVERY - MariaDB Replica SQL: s1 on clouddb1013 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:15:34] RECOVERY - MariaDB read only s1 on clouddb1013 is OK: Version 10.6.19-MariaDB, Uptime 51s, read_only: True, event_scheduler: False, 2013.38 QPS, connection latency: 0.017547s, query latency: 0.000476s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:15:34] RECOVERY - MariaDB read only wikireplica-s1 on clouddb1013 is OK: Version 10.6.19-MariaDB, Uptime 51s, read_only: True, event_scheduler: False, 2036.61 QPS, connection latency: 0.028063s, query latency: 0.000533s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:15:46] RECOVERY - MariaDB Replica IO: s1 on clouddb1013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:16:14] RECOVERY - MariaDB Replica IO: s3 on clouddb1013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:16:32] RECOVERY - MariaDB Replica SQL: s3 on clouddb1013 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:16:36] RECOVERY - MariaDB read only wikireplica-s3 on clouddb1013 is OK: Version 10.6.19-MariaDB, Uptime 46s, read_only: True, event_scheduler: False, 202.07 QPS, connection latency: 0.019260s, query latency: 0.000501s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:16:36] RECOVERY - MariaDB read only s3 on clouddb1013 is OK: Version 10.6.19-MariaDB, Uptime 46s, read_only: True, event_scheduler: False, 200.30 QPS, connection latency: 0.025202s, query latency: 0.000572s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:16:48] RECOVERY - mysqld processes on clouddb1013 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:17:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [09:19:00] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s3 [09:19:09] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s1 [09:21:15] (03PS3) 10Hamish: Configure ContactPage and IPBE contact form on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998) [09:21:35] !log copy python3-docker-report from bullseye-wikimedia to bookworm-wikimedia [09:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:47] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s2 [09:21:50] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s7 [09:22:15] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrading mariadb on clouddb1014 T365424 [09:22:19] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [09:22:30] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrading mariadb on clouddb1014 T365424 [09:25:49] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10147910 (10Ladsgroup) That ssh key is your production key not WMCS. [09:26:06] (03PS1) 10Elukey: debian: update the target distribution to bookworm-wikimedia [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) [09:26:11] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1014.eqiad.wmnet [09:26:39] (03CR) 10Volans: "Inline the 3 quick changes needed to test it on test-s4 as it's not part of the CORE_SECTIONS" [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [09:26:47] 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#10147916 (10ayounsi) Awesome, great to see progress here ! [09:28:31] (03PS2) 10Elukey: debian: update the target distribution to bookworm-wikimedia [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) [09:29:29] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1014.eqiad.wmnet [09:29:34] PROBLEM - Host clouddb1014 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:34] RECOVERY - Host clouddb1014 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [09:29:50] PROBLEM - MariaDB Replica SQL: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:29:50] PROBLEM - MariaDB Replica IO: s2 on clouddb1014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:30:11] PROBLEM - MariaDB Replica IO: s7 on clouddb1014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:30:11] PROBLEM - MariaDB Replica SQL: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:31:12] RECOVERY - MariaDB Replica SQL: s2 on clouddb1014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:31:18] 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#10147929 (10aborrero) 05Open→03Resolved a:03aborrero it seems there is agreement in the addressing plan. Marking as resolved, will work on {T374712} next. [09:31:34] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10147944 (10ArthurTaylor) Yup. I have it noted as my production key. I don't... [09:31:50] RECOVERY - MariaDB Replica IO: s2 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:31:50] RECOVERY - MariaDB Replica SQL: s7 on clouddb1014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:32:12] RECOVERY - MariaDB Replica IO: s7 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:32:29] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [09:34:20] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s7 [09:34:25] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s2 [09:35:09] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrading mariadb on clouddb1015 T365424 [09:35:13] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [09:35:24] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrading mariadb on clouddb1015 T365424 [09:35:29] (03CR) 10CI reject: [V:04-1] debian: update the target distribution to bookworm-wikimedia [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [09:36:27] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s4 [09:36:30] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s6 [09:38:16] (03CR) 10Ladsgroup: [C:03+1] "I'll deploy it eventually. ooo right now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński) [09:40:22] (03CR) 10JMeybohm: [C:03+1] "fine to ignore lintian IMHO" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [09:42:01] 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10147994 (10ayounsi) Short term I think if you add `[4Gbps]` to the interface description, LibreNMS will [[ https://docs.librenms.org/Extensions/Interface-Descript... [09:42:44] !log upload helm3 3.11.3-2 to bookworm-wikimedia [09:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:15] 10SRE-swift-storage, 06Commons: 404 error opening a specific file on Commons (due to inconsistent state between two swift clusters) - https://phabricator.wikimedia.org/T374773#10147996 (10Aklapper) [09:44:33] (03CR) 10Ladsgroup: [C:03+2] Add kge to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1073154 (https://phabricator.wikimedia.org/T374813) (owner: 10Gerrit maintenance bot) [09:45:55] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10148001 (10hashar) >>! In T373969#10147944, @ArthurTaylor wrote: > Yup. I h... [09:47:03] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1015.eqiad.wmnet [09:49:10] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host chartmuseum2001.codfw.wmnet with OS bookworm [09:49:19] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10148056 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm completed: - chartmuseum2001 (**PASS**)... [09:50:28] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1015.eqiad.wmnet [09:50:36] PROBLEM - MariaDB Replica IO: s6 on clouddb1015 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:50:36] PROBLEM - MariaDB Replica SQL: s6 on clouddb1015 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:50:44] PROBLEM - mysqld processes on clouddb1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:50:44] PROBLEM - MariaDB read only wikireplica-s4 on clouddb1015 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:50:44] PROBLEM - MariaDB read only s4 on clouddb1015 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:50:44] PROBLEM - MariaDB read only s6 on clouddb1015 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:50:44] PROBLEM - MariaDB read only wikireplica-s6 on clouddb1015 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:50:50] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:51:16] PROBLEM - MariaDB Replica SQL: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:51:16] PROBLEM - MariaDB Replica Lag: s6 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:51:16] PROBLEM - MariaDB Replica IO: s4 on clouddb1015 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:56:16] RECOVERY - MariaDB Replica SQL: s4 on clouddb1015 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:56:16] RECOVERY - MariaDB Replica IO: s4 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:56:27] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw [09:56:36] RECOVERY - MariaDB Replica SQL: s6 on clouddb1015 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:56:36] RECOVERY - MariaDB Replica IO: s6 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:56:44] RECOVERY - mysqld processes on clouddb1015 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:56:44] RECOVERY - MariaDB read only s4 on clouddb1015 is OK: Version 10.6.19-MariaDB, Uptime 52s, read_only: True, event_scheduler: False, 409.16 QPS, connection latency: 0.020505s, query latency: 0.000507s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:56:45] RECOVERY - MariaDB read only s6 on clouddb1015 is OK: Version 10.6.19-MariaDB, Uptime 48s, read_only: True, event_scheduler: False, 1426.47 QPS, connection latency: 0.028046s, query latency: 0.000615s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:56:45] RECOVERY - MariaDB read only wikireplica-s6 on clouddb1015 is OK: Version 10.6.19-MariaDB, Uptime 48s, read_only: True, event_scheduler: False, 1433.62 QPS, connection latency: 0.017873s, query latency: 0.000433s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:56:45] RECOVERY - MariaDB read only wikireplica-s4 on clouddb1015 is OK: Version 10.6.19-MariaDB, Uptime 52s, read_only: True, event_scheduler: False, 457.05 QPS, connection latency: 0.030524s, query latency: 0.000555s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:57:16] RECOVERY - MariaDB Replica Lag: s6 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:50] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1000) [10:00:07] !log elukey@deploy1003 Started scap sync-world: Update network policies to allow the new poolcounter vms. [10:03:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:26] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s6 [10:03:29] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s4 [10:03:32] !log elukey@deploy1003 Finished scap sync-world: Update network policies to allow the new poolcounter vms. (duration: 04m 35s) [10:05:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:05:36] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s5 [10:05:40] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s8 [10:06:09] (03PS1) 10Slyngshede: Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 [10:06:16] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1016 T365424 [10:06:19] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [10:06:31] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1016 T365424 [10:07:43] 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10148139 (10Vgutierrez) 05Open→03Stalled idp configuration states that `wmf` membership is enough to access superset (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/re... [10:08:47] (03CR) 10CI reject: [V:04-1] Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 (owner: 10Slyngshede) [10:09:18] 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10148146 (10Vgutierrez) a:03Vgutierrez [10:09:36] 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10148149 (10Vgutierrez) a:05Vgutierrez→03Cyndymediawiksim [10:15:56] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: use VXLAN network as the new default for instance launch [puppet] - 10https://gerrit.wikimedia.org/r/1073163 (https://phabricator.wikimedia.org/T374824) [10:18:48] (03PS1) 10Elukey: services: remove old poolcounter netpolicies for Thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073164 (https://phabricator.wikimedia.org/T332015) [10:19:51] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10148243 (10elukey) The reimage of 2001 went fine, I just repooled it. Let's wait for a day before moving to 1001 so if anything weird comes up, we'll have a quick way to fix (depool 2001). N... [10:20:35] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10148252 (10elukey) a:05jhathaway→03elukey [10:22:34] (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-analytics-test [puppet] - 10https://gerrit.wikimedia.org/r/1073166 (https://phabricator.wikimedia.org/T374396) [10:22:35] (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-analytics-product [puppet] - 10https://gerrit.wikimedia.org/r/1073167 (https://phabricator.wikimedia.org/T374396) [10:22:36] (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1073168 (https://phabricator.wikimedia.org/T374396) [10:22:38] (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-research [puppet] - 10https://gerrit.wikimedia.org/r/1073169 (https://phabricator.wikimedia.org/T374396) [10:22:40] (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-search [puppet] - 10https://gerrit.wikimedia.org/r/1073170 (https://phabricator.wikimedia.org/T374396) [10:22:41] (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1073171 (https://phabricator.wikimedia.org/T374396) [10:22:43] (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1073172 (https://phabricator.wikimedia.org/T374396) [10:22:44] (03PS1) 10Brouberol: Install airflow-dags 2.9.3-py3.10-20240916 by default on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1073173 (https://phabricator.wikimedia.org/T374396) [10:23:22] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: use VXLAN network as the new default for instance launch [puppet] - 10https://gerrit.wikimedia.org/r/1073163 (https://phabricator.wikimedia.org/T374824) [10:23:30] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073163 (https://phabricator.wikimedia.org/T374824) (owner: 10Arturo Borrero Gonzalez) [10:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:24:34] (03CR) 10Btullis: "> Encryption is something that is ensured at the s3 storage level..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [10:24:50] (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-analytics-test [puppet] - 10https://gerrit.wikimedia.org/r/1073166 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [10:25:22] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072546 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [10:25:51] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1016.eqiad.wmnet [10:27:27] (03CR) 10Hnowlan: [C:03+1] services: remove old poolcounter netpolicies for Thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073164 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [10:27:42] (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-analytics-product [puppet] - 10https://gerrit.wikimedia.org/r/1073167 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [10:27:52] (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1073168 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [10:28:03] (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-research [puppet] - 10https://gerrit.wikimedia.org/r/1073169 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [10:28:16] (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-search [puppet] - 10https://gerrit.wikimedia.org/r/1073170 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [10:28:42] (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1073171 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [10:28:55] (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1073172 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [10:29:05] (03PS6) 10Brouberol: cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) [10:29:05] (03PS2) 10Brouberol: cloudnative-pg-cluster: setup good defaults allowing a cluster to be restored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072546 (https://phabricator.wikimedia.org/T372281) [10:29:09] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1016.eqiad.wmnet [10:29:20] PROBLEM - Host clouddb1016 is DOWN: PING CRITICAL - Packet loss = 100% [10:29:25] (03CR) 10Brouberol: cloudnative-pg-cluster: enable wal upload / backups to s3 by default (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [10:29:30] RECOVERY - Host clouddb1016 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [10:29:30] PROBLEM - mysqld processes on clouddb1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:29:38] PROBLEM - MariaDB Replica Lag: s8 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:29:44] PROBLEM - MariaDB read only s8 on clouddb1016 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:29:44] PROBLEM - MariaDB read only wikireplica-s8 on clouddb1016 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:29:44] PROBLEM - MariaDB read only s5 on clouddb1016 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:29:44] PROBLEM - MariaDB read only wikireplica-s5 on clouddb1016 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:29:51] (03CR) 10Stevemunene: [C:03+1] Install airflow-dags 2.9.3-py3.10-20240916 by default on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1073173 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [10:29:52] PROBLEM - MariaDB Replica SQL: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:29:52] PROBLEM - MariaDB Replica IO: s5 on clouddb1016 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:30:20] PROBLEM - MariaDB Replica SQL: s8 on clouddb1016 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:30:20] PROBLEM - MariaDB Replica IO: s8 on clouddb1016 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:30:20] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:30:30] RECOVERY - mysqld processes on clouddb1016 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:30:44] RECOVERY - MariaDB read only wikireplica-s5 on clouddb1016 is OK: Version 10.6.19-MariaDB, Uptime 22s, read_only: True, event_scheduler: False, 418.00 QPS, connection latency: 0.011351s, query latency: 0.000283s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:30:44] RECOVERY - MariaDB read only s5 on clouddb1016 is OK: Version 10.6.19-MariaDB, Uptime 22s, read_only: True, event_scheduler: False, 429.81 QPS, connection latency: 0.028775s, query latency: 0.000484s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:30:44] RECOVERY - MariaDB read only s8 on clouddb1016 is OK: Version 10.6.19-MariaDB, Uptime 19s, read_only: True, event_scheduler: False, 28.05 QPS, connection latency: 0.029275s, query latency: 0.000401s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:30:44] RECOVERY - MariaDB read only wikireplica-s8 on clouddb1016 is OK: Version 10.6.19-MariaDB, Uptime 19s, read_only: True, event_scheduler: False, 46.64 QPS, connection latency: 0.018226s, query latency: 0.000621s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:30:52] RECOVERY - MariaDB Replica IO: s5 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:30:52] RECOVERY - MariaDB Replica SQL: s5 on clouddb1016 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:31:20] RECOVERY - MariaDB Replica SQL: s8 on clouddb1016 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:31:20] RECOVERY - MariaDB Replica IO: s8 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:32:20] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:32:38] RECOVERY - MariaDB Replica Lag: s8 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:33:45] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s8 [10:33:48] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s5 [10:35:23] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10148360 (10phaultfinder) [10:36:34] (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-analytics-test [puppet] - 10https://gerrit.wikimedia.org/r/1073166 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [10:39:22] (03PS1) 10Kevin Bazira: ml-services: update rec-api image in staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073176 (https://phabricator.wikimedia.org/T371515) [10:47:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10148384 (10elukey) I checked for `RSC` in the dump that I made from Redfish, and I see the following: ` "RSC_WR_6SLOT1PCI_E4_0X16OPROM": "EFI", "RSC_W_66G4SLOT1PCI_E4_... [10:48:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10148385 (10ayounsi) Note that the Bird exporter is already up and running: https://grafana.wikimedia.org/d/dxbfeGDZk/anycast We could in theory correl... [10:49:38] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, indeed okay to ignore the failure" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [10:50:19] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1349.eqiad.wmnet [10:50:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM chartmuseum1001.eqiad.wmnet [10:55:18] I’m seeing “Could not resolve host: gerrit.wikimedia.org” errors in various CI jobs (not always but too often for my comfort), were there any DNS changes recently? T374830 [10:55:19] T374830: Various CI jobs failing with: Could not resolve host: gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830 [10:56:09] (03CR) 10Btullis: [V:03+1 C:03+2] Add ORKG triplestore to WDQS federation allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1072723 (https://phabricator.wikimedia.org/T366485) (owner: 10Btullis) [10:56:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1349.eqiad.wmnet [10:59:13] FIRING: [2x] JobUnavailable: Reduced availability for job chartmuseum in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:59:38] (03CR) 10Btullis: [C:03+1] "Great!. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [11:00:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM chartmuseum1001.eqiad.wmnet [11:03:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:07:18] (03PS2) 10Slyngshede: Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 [11:08:06] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: codfw1dev: use VXLAN network as the new default for instance launch [puppet] - 10https://gerrit.wikimedia.org/r/1073163 (https://phabricator.wikimedia.org/T374824) (owner: 10Arturo Borrero Gonzalez) [11:21:24] (03CR) 10EoghanGaffney: [C:03+1] aphlict: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [11:21:26] (03PS1) 10Muehlenhoff: Install a NOTICE file [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1073183 (https://phabricator.wikimedia.org/T331969) [11:22:52] (03CR) 10CI reject: [V:04-1] Install a NOTICE file [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1073183 (https://phabricator.wikimedia.org/T331969) (owner: 10Muehlenhoff) [11:23:43] 06SRE, 06Infrastructure-Foundations, 10netops: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712#10148509 (10aborrero) p:05Triage→03Medium [11:23:56] (03CR) 10Muehlenhoff: [C:04-1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072690 needs to be merged first (and this patch updated to use it)" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [11:26:46] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10148515 (10MoritzMuehlenhoff) >>! In T331969#10148243, @elukey wrote: > The reimage of 2001 went fine, I just repooled it. Let's wait for a day before moving to 1001 so if anything weird come... [11:29:48] 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10148518 (10Cyndymediawiksim) Hi @Vgutierrez , yes am having issues accessing superset on https://superset.wikimedia.org. See attached image below : {F57514746} [11:32:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10148524 (10ABran-WMF) all hosts are depoolable for this task [11:33:56] (03CR) 10Muehlenhoff: [C:03+2] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [11:44:30] (03PS1) 10EoghanGaffney: lists: Roll out nftables on both list hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073189 [11:46:44] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1073189 (owner: 10EoghanGaffney) [11:47:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Note that lists1004 needs to be rebooted to fully effect the change." [puppet] - 10https://gerrit.wikimedia.org/r/1073189 (owner: 10EoghanGaffney) [11:47:54] PROBLEM - Router interfaces on cr1-magru is CRITICAL: CRITICAL: host 195.200.68.128, interfaces up: 47, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:48:10] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:48:48] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:49:27] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update rec-api image in staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073176 (https://phabricator.wikimedia.org/T371515) (owner: 10Kevin Bazira) [11:56:10] (03PS1) 10Hnowlan: videoscalers: enable error logging on tls terminator envoy [puppet] - 10https://gerrit.wikimedia.org/r/1073192 (https://phabricator.wikimedia.org/T373517) [11:57:48] (03PS6) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 [11:58:04] (03CR) 10Effie Mouzeli: app.job: update to job 3.0.0 (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli) [11:58:46] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [11:58:51] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: setup good defaults allowing a cluster to be restored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072546 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [12:01:03] (03CR) 10Effie Mouzeli: [C:03+1] videoscalers: enable error logging on tls terminator envoy [puppet] - 10https://gerrit.wikimedia.org/r/1073192 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [12:01:22] (03CR) 10Brouberol: [C:03+1] hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1072660 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [12:01:48] (03CR) 10Brouberol: [C:03+1] hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [12:02:22] (03Merged) 10jenkins-bot: cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [12:02:29] (03Merged) 10jenkins-bot: cloudnative-pg-cluster: setup good defaults allowing a cluster to be restored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072546 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [12:02:38] (03CR) 10Hnowlan: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1073192 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [12:02:48] (03CR) 10Brouberol: [C:03+1] Update the URL of the WikiPathways SPARQL endpoint to use HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/1072734 (https://phabricator.wikimedia.org/T364448) (owner: 10Btullis) [12:03:13] (03CR) 10Brouberol: [C:03+1] MPIC: New deployment (v0.1.5) to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073152 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [12:04:48] (03CR) 10Muehlenhoff: "LGTM, one additional comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [12:13:31] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review. :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073176 (https://phabricator.wikimedia.org/T371515) (owner: 10Kevin Bazira) [12:14:23] (03Merged) 10jenkins-bot: ml-services: update rec-api image in staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073176 (https://phabricator.wikimedia.org/T371515) (owner: 10Kevin Bazira) [12:17:38] (03CR) 10Muehlenhoff: [C:03+2] Only run puppetserver spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1072505 (owner: 10Muehlenhoff) [12:18:38] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:19:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1004.wikimedia.org [12:23:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1004.wikimedia.org [12:23:30] !log installing glibc bugfix updates from bookworm 12.7 point release [12:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:00] (03PS1) 10Arturo Borrero Gonzalez: cloud: codfw1dev: have a new bastion host in bastion-codfw1dev-04 [puppet] - 10https://gerrit.wikimedia.org/r/1073205 (https://phabricator.wikimedia.org/T374828) [12:28:50] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:29:13] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:30:19] (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-analytics-product [puppet] - 10https://gerrit.wikimedia.org/r/1073167 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [12:30:39] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:31:16] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10148655 (10MoritzMuehlenhoff) [12:31:48] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10148656 (10MoritzMuehlenhoff) [12:33:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072556 (https://phabricator.wikimedia.org/T374621) (owner: 10Hamish) [12:33:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998) (owner: 10Hamish) [12:40:05] jouncebot: refresh [12:40:05] I refreshed my knowledge about deployments. [12:40:07] jouncebot: now [12:40:07] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [12:40:10] jouncebot: nowandnext [12:40:10] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [12:40:10] In 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1300) [12:40:12] (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1073168 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [12:43:55] (03CR) 10Muehlenhoff: "These Cumin aliases don't define individual clusters, but a combination of roles and data centers? We do the same for mariadb as well (rol" [puppet] - 10https://gerrit.wikimedia.org/r/1071609 (owner: 10Muehlenhoff) [12:44:53] (03PS6) 10Muehlenhoff: puppetmaster::frontend|backend: Read the puppet-merge server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) [12:45:34] (03CR) 10Btullis: [V:03+1 C:03+2] Update the URL of the WikiPathways SPARQL endpoint to use HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/1072734 (https://phabricator.wikimedia.org/T364448) (owner: 10Btullis) [12:49:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [12:50:03] (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-research [puppet] - 10https://gerrit.wikimedia.org/r/1073169 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [12:51:25] !log installing node-undici security updates [12:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:36] <_Gerges> jouncebot: next [12:51:36] In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1300) [12:52:02] (03CR) 10Hamish: Configure ContactPage and IPBE contact form on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998) (owner: 10Hamish) [12:53:16] * hashar grabs a coffee [12:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:55:54] RECOVERY - Router interfaces on cr1-magru is OK: OK: host 195.200.68.128, interfaces up: 48, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:57:44] \o/ [12:57:54] _Gerges: if you are around I will start with your patch [12:58:02] (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-search [puppet] - 10https://gerrit.wikimedia.org/r/1073170 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [12:58:38] <_Gerges> Here [12:58:56] PROBLEM - Router interfaces on cr1-magru is CRITICAL: CRITICAL: host 195.200.68.128, interfaces up: 47, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:59:56] there is something about running `extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --namespace` [12:59:57] ;) [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1300). [13:00:05] MatmaRex, _Gerges, and Hamishcz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072848 (https://phabricator.wikimedia.org/T374089) (owner: 10GergesShamon) [13:00:12] hi [13:00:17] o/ [13:00:18] hi! [13:00:32] hi :| [13:00:33] (03PS1) 10Hnowlan: videoscaler: bump idle_timeouts for envoy tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/1073209 (https://phabricator.wikimedia.org/T373517) [13:00:51] (03Merged) 10jenkins-bot: [sewikimedia] Enable signatures in the User-namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072848 (https://phabricator.wikimedia.org/T374089) (owner: 10GergesShamon) [13:00:53] MatmaRex: I have no clue why we have to define MW_ENTRY_POINT='static' since I thought w/static.php was simply reading files from disk :D But clearly it loads the whole of MediaWiki! \o/ [13:01:09] that is the first bbackport of the day apparently [13:01:12] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] [13:01:14] so it might take a while :/ [13:01:17] (03CR) 10Bartosz Dziewoński: "Yes: https://gerrit.wikimedia.org/g/mediawiki/core/+/3925b14ffbc1bb95808ae2befc633f4c35cc4e6d/includes/skins/components/SkinComponentFoote" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński) [13:01:17] T374089: Enable signatures in the User-namespace for se.wikimedia.org - https://phabricator.wikimedia.org/T374089 [13:01:17] ah no [13:02:09] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [13:02:35] hashar: it surprised me too. it seems that it's just to load a few helper classes for reading files and serving responses. i feel like maybe it shouldn't run the mediawiki startup code, but i'm definitely not trying to change that now [13:02:52] MatmaRex: definitely not -:-] [13:03:22] some container image is being pushed [13:04:02] (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1073171 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [13:04:15] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy1003&var-datasource=thanos&var-cluster=misc&from=now-1h&to=now&viewPanel=8 [13:04:15] hashar: I have a dev version for one of my patches, but I'm not sure the hide-if logic in it is available or not, would u want me to do a test or just sync the stable version? [13:04:23] 10SRE-swift-storage, 06Commons, 07Wikimedia-production-error: API request failed (backend-fail-internal): An unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T337991#10148790 (10GPSLeo) I do not think that this is the same bug. Here the files are indeed not de... [13:05:23] (03CR) 10MVernon: [C:04-1] "I'm not quite clear on the purpose above just using the role directly, then." [puppet] - 10https://gerrit.wikimedia.org/r/1071609 (owner: 10Muehlenhoff) [13:05:48] Hamishcz: no idea, I hvaen't looked at your patches :) [13:06:08] (03CR) 10Effie Mouzeli: [C:03+1] videoscaler: bump idle_timeouts for envoy tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/1073209 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [13:06:23] okay.. then leave it alone [13:06:24] lol [13:07:17] 13:06:31 docker_pull_k8s: 68% (in-flight: 80; ok: 298; fail: 0; left: 54) - [13:07:21] still progressing [13:07:30] (03CR) 10Hnowlan: [C:03+2] videoscaler: bump idle_timeouts for envoy tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/1073209 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [13:08:12] (03PS1) 10Daimona Eaytoy: beta: Enable CampaignEvents Community List [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073211 (https://phabricator.wikimedia.org/T374617) [13:09:29] Hamishcz: one of your change needs to be rebased if you can do that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1072556 :) [13:09:35] Hello there. Would anyone be willing to merge a beta-only patch when the current window is over? TIA! [13:10:02] sure [13:10:52] (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1073172 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [13:11:18] 13:10:54 K8s deployment progress: 62% (ok: 5; fail: 0; left: 3) \ [13:11:19] :/ [13:12:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:53] 13:12:38 Finished sync-testservers-k8s (duration: 04m 01s) [13:12:57] so yeah that takes a bit of time :/ [13:13:08] weird [13:13:37] !log hashar@deploy1003 gergesshamon, hashar: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:13:37] !log hashar@deploy1003 Sync cancelled. [13:13:40] T374089: Enable signatures in the User-namespace for se.wikimedia.org - https://phabricator.wikimedia.org/T374089 [13:13:44] oh fuck that [13:13:55] Continue with sync? [y/N]: 13:13:36 Sync cancelled. [13:13:59] cause of course NO is the default [13:14:16] so if by mistake you have pressed enter in the terminal previously, that causes the sync to cancel [13:14:17] ... [13:14:31] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] [13:16:59] yeah. It makes you wish that for a CLI tool that has a potentially-long operation followed by a user input prompt, it might make sense to consume and discard all the pending keyboard input before displaying the prompt. [13:17:15] but then if you do that, someone will complain that they pre-pressed some input key and it didn't accept it when expected :P [13:17:35] sounds like space bar heating problem yeah [13:17:43] (but I still think the former is better, if it's a critical confirmation prompt and you want the user to read the output first) [13:17:47] I too [13:17:51] (03PS1) 10Hamish: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621) [13:18:12] (03PS2) 10Hamish: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621) [13:18:13] I guess scap python should do something like sys.stdin.flush() [13:18:19] before asking for input [13:18:44] “you want the user to read the output first” – agree, I think [13:18:47] (03PS1) 10Jgreen: Add payments-a-codfw.wikimedia.org 208.80.152.227 A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1073213 (https://phabricator.wikimedia.org/T373942) [13:18:52] though I also wonder if it would make sense for this prompt to just not have a default [13:18:52] (03CR) 10CI reject: [V:04-1] eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621) (owner: 10Hamish) [13:19:06] Continue with sync? [y/n] [13:19:10] and any other input -> repeat the question [13:20:01] !log hashar@deploy1003 hashar, gergesshamon: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:20:01] !log hashar@deploy1003 Sync cancelled. [13:20:05] T374089: Enable signatures in the User-namespace for se.wikimedia.org - https://phabricator.wikimedia.org/T374089 [13:20:08] (03PS3) 10Hamish: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621) [13:20:08] fuck [13:20:10] raelly [13:20:13] there is no other word [13:20:18] (03CR) 10Elukey: [C:03+1] Add an explicit Hiera variable to determine the active swift ring server [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [13:20:20] * hashar logouts and try again [13:21:04] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] [13:21:48] (03Abandoned) 10Hamish: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072556 (https://phabricator.wikimedia.org/T374621) (owner: 10Hamish) [13:22:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:56] (03PS2) 10Brouberol: Install airflow-dags 2.9.3-py3.10-20240916 by default on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1073173 (https://phabricator.wikimedia.org/T374396) [13:24:39] hashar: i need to step away for some 20 minutes, brb [13:24:51] MatmaRex: yeah don't worry I will handle your patches :) [13:25:09] I am opened a brand new terminal and set it aside [13:25:35] !log hashar@deploy1003 gergesshamon, hashar: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:25:41] T374089: Enable signatures in the User-namespace for se.wikimedia.org - https://phabricator.wikimedia.org/T374089 [13:25:45] !log hashar@deploy1003 gergesshamon, hashar: Continuing with sync [13:25:48] (03CR) 10Brouberol: [C:03+2] Install airflow-dags 2.9.3-py3.10-20240916 by default on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1073173 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol) [13:26:21] (03CR) 10Ssingh: [C:03+1] Add payments-a-codfw.wikimedia.org 208.80.152.227 A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1073213 (https://phabricator.wikimedia.org/T373942) (owner: 10Jgreen) [13:26:26] and it takes roughly 4 minutes and 30 seconds of overhead before reaching that prompt [13:27:04] <_Gerges> I did the test, the signature button appears. [13:27:09] _Gerges: thank you! [13:30:12] hashar, I have to leave at the moment, please forget my patches for this window:) [13:30:23] sorry for any inconvenience [13:30:27] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1 [13:30:29] Hamishcz: sorry everything is so slow today :/ [13:30:31] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3 [13:30:43] Hamishcz: I will do the throttling ones at least [13:30:47] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10148864 (10elukey) Cross-posting from T365167#10148384, where I am testing a reimage for sretest2001. On sretest2001 we have 10G/25G cap... [13:30:53] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424 [13:30:57] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [13:31:08] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424 [13:31:13] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424 [13:31:28] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424 [13:33:18] (03CR) 10Ssingh: "zone file changes look OK, I leave the hostname to your expertise :)" [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) (owner: 10Dwisehaupt) [13:33:21] (03CR) 10Ssingh: [C:03+1] frack: remove fraban2001 from dns for decommissioning [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) (owner: 10Dwisehaupt) [13:34:48] I don't know what is happening with kubernetes today [13:34:58] it looks like everything is slower than usually [13:35:36] that one has been going for 15 minutes already :/ [13:36:39] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] (duration: 15m 35s) [13:36:41] !log sudo cumin "A:cp" 'disable-puppet "merging CR 1072566"' [13:36:43] T374089: Enable signatures in the User-namespace for se.wikimedia.org - https://phabricator.wikimedia.org/T374089 [13:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:52] pff so one done [13:37:03] <_Gerges> :) [13:37:25] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1017.eqiad.wmnet [13:37:25] <_Gerges> Thanks [13:37:47] !log mwmaint: mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki=sewikimedia --current --namespace 2 # T374089 [13:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072624 (https://phabricator.wikimedia.org/T374286) (owner: 10Bartosz Dziewoński) [13:38:58] (03CR) 10Ssingh: [C:03+2] haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [13:39:09] (03Merged) 10jenkins-bot: Define MW_ENTRY_POINT in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072624 (https://phabricator.wikimedia.org/T374286) (owner: 10Bartosz Dziewoński) [13:39:22] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072624|Define MW_ENTRY_POINT in static.php (T374286)]] [13:39:25] T374286: On sso.wikimedia.beta.wmflabs.org login page, the "Powered by MediaWiki" icon does not render - https://phabricator.wikimedia.org/T374286 [13:40:31] (03CR) 10Muehlenhoff: [C:03+2] puppetmaster::frontend|backend: Read the puppet-merge server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [13:40:36] 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10148900 (10Vgutierrez) 05Stalled→03Declined After double checking that I get the very same errors as @Cyndymediawiksim it looks like it's an issue with that specific superset dashboa... [13:40:42] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1017.eqiad.wmnet [13:40:48] PROBLEM - MariaDB read only wikireplica-s1 on clouddb1017 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:40:48] PROBLEM - MariaDB read only s1 on clouddb1017 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:40:48] PROBLEM - MariaDB read only s3 on clouddb1017 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:40:48] PROBLEM - MariaDB read only wikireplica-s3 on clouddb1017 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:41:04] PROBLEM - mysqld processes on clouddb1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:41:24] PROBLEM - MariaDB Replica SQL: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:41:24] PROBLEM - MariaDB Replica IO: s1 on clouddb1017 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:41:42] PROBLEM - MariaDB Replica IO: s3 on clouddb1017 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:41:42] PROBLEM - MariaDB Replica SQL: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:04] RECOVERY - mysqld processes on clouddb1017 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:42:24] RECOVERY - MariaDB Replica SQL: s1 on clouddb1017 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:24] RECOVERY - MariaDB Replica IO: s1 on clouddb1017 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:30] hashar: (back) [13:42:40] (03CR) 10Jforrester: Improve $wgFooterIcons override, remove $wmgWikimediaIcon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński) [13:42:42] RECOVERY - MariaDB Replica IO: s3 on clouddb1017 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:42] RECOVERY - MariaDB Replica SQL: s3 on clouddb1017 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:48] RECOVERY - MariaDB read only s1 on clouddb1017 is OK: Version 10.6.19-MariaDB, Uptime 51s, read_only: True, event_scheduler: False, 482.27 QPS, connection latency: 0.029644s, query latency: 0.000587s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:42:48] RECOVERY - MariaDB read only wikireplica-s1 on clouddb1017 is OK: Version 10.6.19-MariaDB, Uptime 51s, read_only: True, event_scheduler: False, 477.29 QPS, connection latency: 0.029675s, query latency: 0.000602s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:42:48] RECOVERY - MariaDB read only wikireplica-s3 on clouddb1017 is OK: Version 10.6.19-MariaDB, Uptime 48s, read_only: True, event_scheduler: False, 907.88 QPS, connection latency: 0.019413s, query latency: 0.000481s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:42:49] RECOVERY - MariaDB read only s3 on clouddb1017 is OK: Version 10.6.19-MariaDB, Uptime 48s, read_only: True, event_scheduler: False, 903.31 QPS, connection latency: 0.017440s, query latency: 0.000465s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:42:58] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet [reason: testing NOOP CR but depooling to be extra sure] [13:42:58] MatmaRex: the MW_ENTRY_POINT patch is being deployed [13:43:01] !log hashar@deploy1003 hashar, matmarex: Backport for [[gerrit:1072624|Define MW_ENTRY_POINT in static.php (T374286)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:43:10] (03PS8) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) [13:43:20] MatmaRex: I already ran the beta updater [13:43:23] hashar: thanks. currently it only affects the beta cluster [13:43:29] !log hashar@deploy1003 hashar, matmarex: Continuing with sync [13:43:31] (03CR) 10JMeybohm: [C:03+1] app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli) [13:43:32] \o/ [13:43:46] beta is syncing the code https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/172385/console [13:44:03] I will do Hamushcz throttling patch next [13:44:10] hmm no [13:44:16] lets do the default to log any error [13:45:32] (03PS1) 10Hnowlan: envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) [13:47:39] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [reason: [done] testing NOOP CR but depooling to be extra sure] [13:48:14] (03CR) 10Effie Mouzeli: [C:03+2] app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli) [13:48:32] !log sudo cumin -b11 "A:cp" 'run-puppet-agent --enable "merging CR 1072566"' [13:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:54] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s3 [13:48:58] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1 [13:49:25] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072624|Define MW_ENTRY_POINT in static.php (T374286)]] (duration: 10m 03s) [13:49:29] T374286: On sso.wikimedia.beta.wmflabs.org login page, the "Powered by MediaWiki" icon does not render - https://phabricator.wikimedia.org/T374286 [13:49:41] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s2 [13:49:45] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s7 [13:50:04] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrading mariadb on clouddb1018 T365424 [13:50:08] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [13:50:20] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrading mariadb on clouddb1018 T365424 [13:51:18] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3991/console" [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [13:51:54] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:51:56] RECOVERY - Router interfaces on cr1-magru is OK: OK: host 195.200.68.128, interfaces up: 48, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:52:10] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:54:09] ok next [13:54:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński) [13:54:55] (03PS3) 10Hashar: logging: Default to log any error (on group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński) [13:54:55] jouncebot: now [13:54:55] For the next 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1300) [13:55:01] jouncebot: next [13:55:01] In 1 hour(s) and 34 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1530) [13:55:01] (03CR) 10TrainBranchBot: "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński) [13:55:38] effie: I am going to extend the backport window given all the slowness we had earlier [13:55:55] (03Merged) 10jenkins-bot: logging: Default to log any error (on group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński) [13:56:05] hashar: no problem, I saw that you lot were busy [13:56:07] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072207|logging: Default to log any error (on group0) (T228838)]] [13:56:10] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [13:56:16] tx hashar [13:57:20] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1018.eqiad.wmnet [13:57:43] (03PS9) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) [13:58:05] !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1072207|logging: Default to log any error (on group0) (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:58:20] and I guess I will escalate wikifunctions [13:58:25] cause it is still spamming the logs :D [13:58:30] (03PS2) 10Hnowlan: envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) [13:59:34] !log hashar@deploy1003 matmarex, hashar: Continuing with sync [14:00:18] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1018.eqiad.wmnet [14:00:22] PROBLEM - MariaDB Replica SQL: s7 on clouddb1018 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:00:24] PROBLEM - MariaDB Replica SQL: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:00:24] PROBLEM - MariaDB Replica IO: s7 on clouddb1018 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:00:46] PROBLEM - mysqld processes on clouddb1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:00:48] PROBLEM - MariaDB read only wikireplica-s2 on clouddb1018 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:00:48] PROBLEM - MariaDB read only wikireplica-s7 on clouddb1018 is CRITICAL: Could not connect to localhost:3317 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:00:49] PROBLEM - MariaDB read only s2 on clouddb1018 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:00:49] PROBLEM - MariaDB read only s7 on clouddb1018 is CRITICAL: Could not connect to localhost:3317 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:00:52] PROBLEM - MariaDB Replica IO: s2 on clouddb1018 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:00:55] (03CR) 10CI reject: [V:04-1] envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [14:01:02] (03PS3) 10Hnowlan: envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) [14:01:44] RECOVERY - mysqld processes on clouddb1018 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:01:50] RECOVERY - MariaDB read only wikireplica-s2 on clouddb1018 is OK: Version 10.6.19-MariaDB, Uptime 43s, read_only: True, event_scheduler: False, 23.13 QPS, connection latency: 0.018652s, query latency: 0.000348s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:01:50] RECOVERY - MariaDB read only wikireplica-s7 on clouddb1018 is OK: Version 10.6.19-MariaDB, Uptime 40s, read_only: True, event_scheduler: False, 22.89 QPS, connection latency: 0.016599s, query latency: 0.000681s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:01:51] RECOVERY - MariaDB read only s2 on clouddb1018 is OK: Version 10.6.19-MariaDB, Uptime 43s, read_only: True, event_scheduler: False, 22.96 QPS, connection latency: 0.027851s, query latency: 0.000655s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:01:51] RECOVERY - MariaDB read only s7 on clouddb1018 is OK: Version 10.6.19-MariaDB, Uptime 40s, read_only: True, event_scheduler: False, 22.82 QPS, connection latency: 0.030815s, query latency: 0.000489s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:02:09] (03CR) 10Santiago Faci: [C:03+2] MPIC: New deployment (v0.1.5) to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073152 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [14:02:26] (03PS1) 10AOkoth: vrts: fix install script [puppet] - 10https://gerrit.wikimedia.org/r/1073224 (https://phabricator.wikimedia.org/T373420) [14:02:42] PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 740.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:03:17] (03Merged) 10jenkins-bot: MPIC: New deployment (v0.1.5) to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073152 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [14:03:22] RECOVERY - MariaDB Replica SQL: s7 on clouddb1018 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:03:22] RECOVERY - MariaDB Replica IO: s7 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:03:22] RECOVERY - MariaDB Replica SQL: s2 on clouddb1018 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:03:24] (03CR) 10CI reject: [V:04-1] envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [14:03:52] RECOVERY - MariaDB Replica IO: s2 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:04:06] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072207|logging: Default to log any error (on group0) (T228838)]] (duration: 07m 59s) [14:04:07] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Create PCC Puppet 8 nodes - https://phabricator.wikimedia.org/T374495#10149014 (10jhathaway) p:05Triage→03Medium [14:04:10] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [14:04:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621) (owner: 10Hamish) [14:04:55] MatmaRex: I am watching https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&refresh=5m&from=now-1h&to=now [14:05:08] thanks [14:05:21] (03Merged) 10jenkins-bot: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621) (owner: 10Hamish) [14:05:28] hashar: i'm only half-following now because we have a meeting [14:05:33] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1073212|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374621)]] [14:05:36] T374621: Lift IP cap on this dates 27/09 and 28/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T374621 [14:05:42] RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:05:43] MatmaRex: no worries, I am watching the logs :) [14:05:47] (03PS4) 10Hnowlan: envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) [14:06:22] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s7 [14:06:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073211 (https://phabricator.wikimedia.org/T374617) (owner: 10Daimona Eaytoy) [14:06:26] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s2 [14:07:13] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [14:07:26] !log hashar@deploy1003 hamishz, hashar: Backport for [[gerrit:1073212|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374621)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:07:32] !log hashar@deploy1003 hamishz, hashar: Continuing with sync [14:07:37] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 [14:07:40] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s6 [14:09:50] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrading mariadb on clouddb1019 T365424 [14:09:54] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [14:10:05] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrading mariadb on clouddb1019 T365424 [14:10:54] MatmaRex: I think it is all good. Happy meeting! [14:10:55] (03PS5) 10Hnowlan: envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) [14:11:45] hashar: is it live? i expected to see *some* new errors :o [14:11:58] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10149030 (10joanna_borun) p:05Triage→03Low [14:12:10] I am still digging in logstash [14:12:14] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10149032 (10elukey) [14:12:14] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073212|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374621)]] (duration: 06m 41s) [14:12:17] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [14:12:18] T374621: Lift IP cap on this dates 27/09 and 28/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T374621 [14:12:19] + it is only on group0 so far [14:14:16] MatmaRex: I will add a patch for group1 and deploy it tomorrow [14:15:03] I have deployed everything beside the ContactPage patch https://gerrit.wikimedia.org/r/c/1072876/ [14:15:18] !log Afternoon backport window is complete [14:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:22] (03CR) 10Volans: "Will this clear up the access?" [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri) [14:15:24] effie: ^ I am done [14:18:21] (03PS15) 10Hashar: logging: Default to log any error (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [14:18:21] (03PS1) 10Hashar: logging: Default to log any error (on group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073232 (https://phabricator.wikimedia.org/T228838) [14:18:38] thanks hashar [14:19:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073232 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [14:19:39] and it is scheduled! [14:19:47] I have no idea how that tool work but it does work! [14:20:05] (03CR) 10FNegri: "yes because https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/manifests/cumin/target.pp#34" [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri) [14:20:13] swfrench-wmf: I am done with the backport window, but please sync with effie who seems to have something pending as well :) [14:20:34] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Per host access control for kerberized SSH - https://phabricator.wikimedia.org/T276790#10149062 (10joanna_borun) dependent on https://phabricator.wikimedia.org/T244840 [14:20:53] hashar: tx! [14:21:27] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri) [14:22:09] (03CR) 10Jcrespo: [C:03+1] R:wmcs::db::wikireplicas remove access from cloudcumin [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri) [14:22:17] 06SRE, 06Infrastructure-Foundations, 10Mail, 07Surveys: Qualtrics cannot send email to wikimedia.org addresses - https://phabricator.wikimedia.org/T176666#10149077 (10joanna_borun) 05Open→03Declined [14:22:32] (03CR) 10Ayounsi: [C:03+2] Remove RPKI rsync alerting [alerts] - 10https://gerrit.wikimedia.org/r/1068019 (owner: 10Ayounsi) [14:23:02] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10149067 (10ayounsi) Those won't be in a VC, especially as we didn't pay for the extra VC license :) This means a bit more manual config until... [14:23:17] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Release-Engineering-Team (Seen): Support running puppet Beaker on CI - https://phabricator.wikimedia.org/T253635#10149074 (10joanna_borun) @hashar is this task still valid? [14:23:25] (03CR) 10FNegri: [C:03+2] R:wmcs::db::wikireplicas remove access from cloudcumin [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri) [14:23:52] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374808#10149088 (10Jhancock.wm) T374422 working on it [14:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:24:08] (03Merged) 10jenkins-bot: Remove RPKI rsync alerting [alerts] - 10https://gerrit.wikimedia.org/r/1068019 (owner: 10Ayounsi) [14:24:38] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1019.eqiad.wmnet [14:25:37] 06SRE, 06Infrastructure-Foundations, 05Goal: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#10149092 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is an old umbrella task which is no longer useful by itself. Closing [14:26:23] 10SRE-tools, 10Icinga, 06Infrastructure-Foundations: get-raid-status-perccli should allow for commands to return non-zero exit code - https://phabricator.wikimedia.org/T320998#10149097 (10SLyngshede-WMF) p:05Medium→03Low a:03SLyngshede-WMF [14:27:17] (03CR) 10Filippo Giunchedi: [C:03+1] ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [14:27:34] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, once the prometheus-equivalent alerts are deployed" [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [14:27:37] (03PS4) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) [14:27:47] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1019.eqiad.wmnet [14:27:48] PROBLEM - MariaDB read only wikireplica-s4 on clouddb1019 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:27:48] PROBLEM - MariaDB read only s6 on clouddb1019 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:27:49] PROBLEM - MariaDB read only s4 on clouddb1019 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:27:49] PROBLEM - MariaDB read only wikireplica-s6 on clouddb1019 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:27:50] PROBLEM - MariaDB Replica SQL: s6 on clouddb1019 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:27:51] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:28:24] PROBLEM - MariaDB Replica Lag: s6 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1004.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:28:48] 06SRE, 06Infrastructure-Foundations, 10Mail: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238#10149110 (10jhathaway) 05Open→03Resolved a:03jhathaway Fixed with change in the config, also no longer relative, as we are now running Postfix [14:28:49] RECOVERY - MariaDB read only s6 on clouddb1019 is OK: Version 10.6.19-MariaDB, Uptime 40s, read_only: True, event_scheduler: False, 656.84 QPS, connection latency: 0.025223s, query latency: 0.000513s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:28:49] RECOVERY - MariaDB read only wikireplica-s4 on clouddb1019 is OK: Version 10.6.19-MariaDB, Uptime 44s, read_only: True, event_scheduler: False, 537.16 QPS, connection latency: 0.015571s, query latency: 0.000441s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:28:49] RECOVERY - MariaDB read only wikireplica-s6 on clouddb1019 is OK: Version 10.6.19-MariaDB, Uptime 40s, read_only: True, event_scheduler: False, 663.42 QPS, connection latency: 0.017064s, query latency: 0.000397s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:28:49] RECOVERY - MariaDB read only s4 on clouddb1019 is OK: Version 10.6.19-MariaDB, Uptime 44s, read_only: True, event_scheduler: False, 531.13 QPS, connection latency: 0.025268s, query latency: 0.000450s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:28:50] 06SRE, 06Infrastructure-Foundations: keyholder: continue to arm keys if one fails - https://phabricator.wikimedia.org/T227272#10149113 (10joanna_borun) 05Open→03Resolved [14:28:51] RECOVERY - MariaDB Replica SQL: s6 on clouddb1019 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:30:24] RECOVERY - MariaDB Replica Lag: s6 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:30:50] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:31:26] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s6 [14:31:29] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 [14:31:54] 06SRE, 13Patch-For-Review: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088#10149115 (10CDanis) a:03mark I think this is being handled as part of the Ownership WG [14:33:34] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Improve sre.hosts.decommission (additionally find host yaml files) - https://phabricator.wikimedia.org/T257297#10149234 (10elukey) 05Open→03Declined Probably not needed anymore :) [14:33:37] hashar: meant to say before, thank you :) [14:34:04] 10SRE-tools, 06Infrastructure-Foundations: Clarify 'wipe bootloader' step in sre.hosts.decommission - https://phabricator.wikimedia.org/T283204#10149250 (10joanna_borun) 05Open→03Declined [14:34:42] (03PS1) 10JMeybohm: Don't restart(stop,start) ferm on puppet notify, use reload instead [puppet] - 10https://gerrit.wikimedia.org/r/1073233 (https://phabricator.wikimedia.org/T374366) [14:35:46] (03CR) 10JMeybohm: "I tried to make the comments a bit more clear - not sure if I succeeded with that" [puppet] - 10https://gerrit.wikimedia.org/r/1073233 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [14:36:04] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209#10149255 (10Volans) 05Open→03Resolved a:03Volans The alertmanager support has been in place for a long time. Resolving. Any additional feature wil... [14:36:06] 10SRE-tools, 06Infrastructure-Foundations: Clarify 'wipe bootloader' step in sre.hosts.decommission - https://phabricator.wikimedia.org/T283204#10149270 (10Volans) As there were no agreement here on task and multiple years have passed we decided to close it. Feel free to reopen in case there is more consen... [14:37:48] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet for datacenter switchover from codfw to eqiad [14:37:51] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) for datacenter switchover from codfw to eqiad [14:39:09] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from codfw to eqiad [14:39:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:28] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) for datacenter switchover from codfw to eqiad [14:40:02] 06SRE, 06Infrastructure-Foundations: Simplify hiera lookup model - https://phabricator.wikimedia.org/T106404#10149291 (10joanna_borun) 05Open→03Declined [14:41:04] 06SRE, 06Infrastructure-Foundations: Simplify hiera lookup model - https://phabricator.wikimedia.org/T106404#10149290 (10joanna_borun) It has been working fine for now but we're open for specific proposals. [14:41:44] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from codfw to eqiad [14:42:50] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [14:43:06] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [14:47:30] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) for datacenter switchover from codfw to eqiad [14:47:43] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from codfw to eqiad [14:47:59] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) for datacenter switchover from codfw to eqiad [14:50:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:53:56] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:54:20] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from codfw to eqiad [14:54:20] !log swfrench@cumin1002 [DRY-RUN] MediaWiki read-only period starts at: 2024-09-16 14:54:20.136310 [14:54:22] !log installing gdk-pixbuf security updates [14:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:35] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) for datacenter switchover from codfw to eqiad [14:54:48] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from codfw to eqiad [14:55:23] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) for datacenter switchover from codfw to eqiad [14:56:03] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from codfw to eqiad [14:56:17] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) for datacenter switchover from codfw to eqiad [14:57:11] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from codfw to eqiad [14:57:15] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) for datacenter switchover from codfw to eqiad [14:57:25] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from codfw to eqiad [14:57:30] !log swfrench@cumin1002 [DRY-RUN] MediaWiki read-only period ends at: 2024-09-16 14:57:30.267664 [14:57:31] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) for datacenter switchover from codfw to eqiad [14:57:48] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from codfw to eqiad [14:57:51] !log root@deploy1003 helmfile [codfw] START helmfile.d/services/mw-jobrunner: sync [14:58:26] !log root@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: sync [14:58:28] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=0) for datacenter switchover from codfw to eqiad [14:59:08] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from codfw to eqiad [15:01:11] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) for datacenter switchover from codfw to eqiad [15:01:23] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from codfw to eqiad [15:02:03] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) for datacenter switchover from codfw to eqiad [15:03:17] !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from codfw to eqiad [15:04:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:32] i wondered what generates more logs, group0 wikis or the beta cluster. it looks like group0 is about 2x the volume (50k vs 25k logging messages per hour). [15:13:47] !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) for datacenter switchover from codfw to eqiad [15:25:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69134 and previous config saved to /var/cache/conftool/dbconfig/20240916-152556-arnaudb.json [15:26:00] T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623 [15:26:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69135 and previous config saved to /var/cache/conftool/dbconfig/20240916-152601-arnaudb.json [15:26:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69136 and previous config saved to /var/cache/conftool/dbconfig/20240916-152606-arnaudb.json [15:26:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69137 and previous config saved to /var/cache/conftool/dbconfig/20240916-152611-arnaudb.json [15:26:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69138 and previous config saved to /var/cache/conftool/dbconfig/20240916-152616-arnaudb.json [15:26:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69139 and previous config saved to /var/cache/conftool/dbconfig/20240916-152621-arnaudb.json [15:26:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69140 and previous config saved to /var/cache/conftool/dbconfig/20240916-152626-arnaudb.json [15:30:05] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1530) [15:34:00] !log dancy@deploy1003 Started deploy [releng/phatality@8ddb2fa]: (no justification provided) [15:34:16] !log dancy@deploy1003 Finished deploy [releng/phatality@8ddb2fa]: (no justification provided) (duration: 00m 15s) [15:36:28] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash2032.codfw.wmnet, logstash2030.codfw.wmnet, logstash2024.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:36:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1030.eqiad.wmnet, logstash1031.eqiad.wmnet, logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:36:36] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash2031.codfw.wmnet, logstash2032.codfw.wmnet, logstash2030.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:36:44] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1032.eqiad.wmnet, logstash1030.eqiad.wmnet, logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:36:57] FIRING: [2x] ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:21] !incidents [15:37:22] 5171 (UNACKED) [2x] ProbeDown sre (ip4 kibana7:443 probes/service http_kibana7_ip4) [15:37:31] !ack 5171 [15:37:32] 5171 (ACKED) [2x] ProbeDown sre (ip4 kibana7:443 probes/service http_kibana7_ip4) [15:37:36] thanks [15:37:40] you were faster [15:37:50] :D [15:38:29] yep it's down all right [15:39:02] any o11y folks around, doing work on ELK right now? [15:39:02] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:39:57] there is an alert on their chan just at the time of the issue: FIRING: [12x] SystemdUnitFailed: opensearch-dashboards.service [15:40:00] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:38] dancy: I think your phatality deploy might be the trigger here, are you looking at that already? [15:40:53] ah the runbook is missing from https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 [15:41:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69141 and previous config saved to /var/cache/conftool/dbconfig/20240916-154101-arnaudb.json [15:41:06] T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623 [15:41:06] That's quite possible. Digging in now. [15:41:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69142 and previous config saved to /var/cache/conftool/dbconfig/20240916-154106-arnaudb.json [15:41:09] hey, no not aware of any work no [15:41:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69143 and previous config saved to /var/cache/conftool/dbconfig/20240916-154112-arnaudb.json [15:41:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69144 and previous config saved to /var/cache/conftool/dbconfig/20240916-154116-arnaudb.json [15:41:17] dancy: thanks <3 lmk if you need anything [15:41:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69145 and previous config saved to /var/cache/conftool/dbconfig/20240916-154121-arnaudb.json [15:41:26] thanks herron [15:41:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69146 and previous config saved to /var/cache/conftool/dbconfig/20240916-154127-arnaudb.json [15:41:28] I got sudo warnings during the deployment. [15:41:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69147 and previous config saved to /var/cache/conftool/dbconfig/20240916-154132-arnaudb.json [15:41:47] (sorry for the spamlog 😬) [15:41:51] !log dancy@deploy1003 Started deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 [15:41:52] only ELK unavailability, right ATM ? [15:42:00] as far as o11y show yep [15:42:13] !log dancy@deploy1003 Finished deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 (duration: 00m 21s) [15:42:16] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:42:18] hrm [15:42:26] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:42:39] 👀 [15:42:55] Sep 16 15:42:14 logstash1023 opensearch-dashboards[2609202]: {"type":"log","@timestamp":"2024-09-16T15:42:14Z","tags":["fatal","root"],"pid":2609202,"message":"Error: Plugin with id \"phatality\" is already registered!\n at MergeMapSubscriber.project [15:43:51] Can someone try running `/usr/bin/systemctl restart opensearch-dashboards` on one of the problem hosts? [15:44:02] can do, stand by [15:44:06] thx [15:44:51] !log rzl@logstash1032:~$ sudo systemctl restart opensearch-dashboards [15:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:30] looks like it's still dying with the same error herron pasted [15:45:50] hrmm.. ok.. I will try rolling back... [15:46:33] !log dancy@deploy1003 Started deploy [releng/phatality@b1a2a70]: Attempting to revert [15:46:40] !log dancy@deploy1003 Finished deploy [releng/phatality@b1a2a70]: Attempting to revert (duration: 00m 06s) [15:46:57] want another bounce? [15:47:13] * swfrench-wmf is here as well now, but in a holding pattern for the moment [15:47:16] The revert deployment seemed to be successful (no complaints, as opposed to the original attempt) [15:47:48] swfrench-wmf: do you mind getting a status doc open? [15:47:48] Another bounce couldn't hurt for verification [15:47:55] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s5 [15:48:02] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s8 [15:48:04] rzl: ack, can do [15:48:26] !log rzl@logstash1032:~$ sudo systemctl restart opensearch-dashboards [15:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:41] dancy: nop, same error [15:48:51] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrading mariadb on clouddb1020 T365424 [15:48:54] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [15:49:01] !log logstash1023:/usr/share/opensearch-dashboards/bin# /usr/share/opensearch-dashboards/bin/opensearch-dashboards-plugin remove phatality [15:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:07] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrading mariadb on clouddb1020 T365424 [15:49:12] trying this as a stopgap ^^ [15:49:44] FIRING: ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:51:57] FIRING: [2x] ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:22] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1020.eqiad.wmnet [15:52:31] herron, dancy: logstash1023 is looking healthy after that, should we do it everywhere? [15:52:41] Yes please [15:52:59] herron: do you want to cumin that out or shall I? [15:53:21] RESOLVED: ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:25] rzl: go for it, fwiw I also just depooled logstash1032 and am able to get to the logstash UI at this point [15:53:38] oh, sweet [15:53:39] that 502 seemed to defy health checks? I'm not sure off hand [15:53:52] actually we can leave 1032 depooled and unfixed if dancy wants it for investigation [15:54:07] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:54:26] That would be convenient. The deployment script clearly needs some work. [15:55:23] ok, I'll leave things as-is for the time being [15:55:44] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1020.eqiad.wmnet [15:55:45] PROBLEM - MariaDB Replica SQL: s8 on clouddb1020 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:55:49] PROBLEM - MariaDB read only s8 on clouddb1020 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:55:49] PROBLEM - MariaDB read only wikireplica-s5 on clouddb1020 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:55:49] PROBLEM - MariaDB read only wikireplica-s8 on clouddb1020 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:55:49] PROBLEM - MariaDB read only s5 on clouddb1020 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:56:03] PROBLEM - mysqld processes on clouddb1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:56:08] hashar, sorry i was in the middle of something IRL, thank you for the deployment [15:56:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69148 and previous config saved to /var/cache/conftool/dbconfig/20240916-155607-arnaudb.json [15:56:11] T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623 [15:56:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69149 and previous config saved to /var/cache/conftool/dbconfig/20240916-155612-arnaudb.json [15:56:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69150 and previous config saved to /var/cache/conftool/dbconfig/20240916-155617-arnaudb.json [15:56:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69151 and previous config saved to /var/cache/conftool/dbconfig/20240916-155622-arnaudb.json [15:56:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69152 and previous config saved to /var/cache/conftool/dbconfig/20240916-155627-arnaudb.json [15:56:29] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69153 and previous config saved to /var/cache/conftool/dbconfig/20240916-155632-arnaudb.json [15:56:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69154 and previous config saved to /var/cache/conftool/dbconfig/20240916-155637-arnaudb.json [15:56:51] RECOVERY - MariaDB read only wikireplica-s5 on clouddb1020 is OK: Version 10.6.19-MariaDB, Uptime 2s, read_only: True, event_scheduler: False, 22.76 QPS, connection latency: 0.023102s, query latency: 0.007230s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:56:51] RECOVERY - MariaDB read only s5 on clouddb1020 is OK: Version 10.6.19-MariaDB, Uptime 2s, read_only: True, event_scheduler: False, 22.91 QPS, connection latency: 0.027916s, query latency: 0.000614s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:57:03] RECOVERY - mysqld processes on clouddb1020 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:57:20] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:57:45] RECOVERY - MariaDB Replica SQL: s8 on clouddb1020 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:51] RECOVERY - MariaDB read only wikireplica-s8 on clouddb1020 is OK: Version 10.6.19-MariaDB, Uptime 58s, read_only: True, event_scheduler: False, 1063.60 QPS, connection latency: 0.011424s, query latency: 0.000285s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:57:51] RECOVERY - MariaDB read only s8 on clouddb1020 is OK: Version 10.6.19-MariaDB, Uptime 58s, read_only: True, event_scheduler: False, 1125.21 QPS, connection latency: 0.012967s, query latency: 0.000319s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:57:56] herron: sanity check? I'm about to run `cumin 'O:logging::opensearch::collector and not logstash1032.eqiad.wmnet' '/usr/share/opensearch-dashboards/bin/opensearch-dashboards-plugin remove phatality'` [15:58:26] rzl: Please leave the broken one installed. [15:58:45] yep, `and not logstash1032` will exclude it [15:58:50] gotcha [15:59:21] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org,service=recdns [15:59:32] rzl: it may need --allow-root as well [15:59:42] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org,service=recdns [15:59:46] I tried to run as opensearch-dashboards myself but got a homedir error so just ran with --allow-root [15:59:46] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s8 [15:59:49] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s5 [15:59:57] ack, thanks [16:00:37] !log rzl@cumin1002:~$ sudo cumin 'O:logging::opensearch::collector and not logstash1032.eqiad.wmnet' '/usr/share/opensearch-dashboards/bin/opensearch-dashboards-plugin --allow-root remove phatality' [16:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:03] failed on logstash1023 with `Unable to remove plugin because of error: "Plugin [phatality] is not installed"`, expected -- succeeded everywhere else [16:01:31] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [16:02:12] rzl: nice on thank you [16:02:15] one* [16:06:08] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:07:53] !log testing strict mode on puppetservers [16:07:55] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:53] herron: any reason I shouldn't restart opensearch-dashboards on all those hosts? [16:09:02] puppet will do it, but any reason it needs to be staggered? [16:09:09] rzl: I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073253 deals w/ the root cause. [16:09:13] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [16:09:33] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:09:40] rzl: it looks like probes are still failing in codfw, were those hosts in the set your cumin run completed on? [16:10:02] swfrench-wmf: yes, but they'll need the systemd unit restarted too [16:10:09] I'm going to JFDI [16:10:11] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:10:21] rzl: I'd say check status first, I think systemd did the right thing? [16:10:26] rzl: ah, got it - ack [16:10:27] holding [16:10:44] herron: I see it `running` on some hosts, `failed` on others [16:11:02] running on the hosts where we've either restarted it by hand or puppet has run in the meantime [16:11:13] rzl: kk yeah I think if we can limit to the failed hosts that'd be ideal [16:11:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69155 and previous config saved to /var/cache/conftool/dbconfig/20240916-161113-arnaudb.json [16:11:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69156 and previous config saved to /var/cache/conftool/dbconfig/20240916-161117-arnaudb.json [16:11:18] T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623 [16:11:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69157 and previous config saved to /var/cache/conftool/dbconfig/20240916-161123-arnaudb.json [16:11:24] herron: sure, here goes [16:11:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69158 and previous config saved to /var/cache/conftool/dbconfig/20240916-161128-arnaudb.json [16:11:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69159 and previous config saved to /var/cache/conftool/dbconfig/20240916-161133-arnaudb.json [16:11:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69160 and previous config saved to /var/cache/conftool/dbconfig/20240916-161138-arnaudb.json [16:11:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69161 and previous config saved to /var/cache/conftool/dbconfig/20240916-161143-arnaudb.json [16:12:27] !log rzl@cumin1002:~$ sudo cumin logstash[2023,2025,2030-2032].codfw.wmnet,logstash[1025,1030,1032].eqiad.wmnet 'systemctl restart opensearch-dashboards' # only hosts where status is failed [16:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:39] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:12:39] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:12:43] dancy: and seen, thanks, I'll take a proper look in a sec [16:12:45] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:13:18] oops I should have excluded 1032 from that restart, but it's a no-op anyway [16:13:31] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:14:09] I'm now seeing all hosts but 1032 healthy, and we should be fully recovered -- anyone still see impact? [16:14:51] awesome thank you rzl [16:15:11] thank you, rzl! [16:15:25] Thanks! That was stressful [16:16:11] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:16:24] I do wonder how `upgrade-phatality.sh` ever worked before. [16:16:34] great quesiton [16:16:57] RESOLVED: ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:17:19] Last change to it was in 2022. Maybe it hasn't worked since then. :-) [16:18:06] hmm.. doesn't look like sudo is even required for the list command. [16:18:40] yeah I just noticed the same [16:18:47] I'll update the script. [16:19:09] cool -- if you do end up wanting to merge the sudoers change let me know [16:21:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073256 [16:22:18] !log ebernhardson@deploy1003 Started deploy [airflow-dags/search@5ad6710]: standardize created file permissions [16:22:41] !log ebernhardson@deploy1003 Finished deploy [airflow-dags/search@5ad6710]: standardize created file permissions (duration: 00m 22s) [16:23:51] dancy: LGTM, ready for me to merge it or do you want any other review first? [16:24:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:25:00] rzl: Merge please [16:25:24] widespread puppet failures in eqiad [16:25:30] I suspect it is puppetserver1002 acting up again [16:25:41] yeah let's get that figured out first but then I'll go ahead [16:26:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69162 and previous config saved to /var/cache/conftool/dbconfig/20240916-162618-arnaudb.json [16:26:23] T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623 [16:26:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69163 and previous config saved to /var/cache/conftool/dbconfig/20240916-162623-arnaudb.json [16:26:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69164 and previous config saved to /var/cache/conftool/dbconfig/20240916-162629-arnaudb.json [16:26:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69165 and previous config saved to /var/cache/conftool/dbconfig/20240916-162633-arnaudb.json [16:26:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69166 and previous config saved to /var/cache/conftool/dbconfig/20240916-162638-arnaudb.json [16:26:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69167 and previous config saved to /var/cache/conftool/dbconfig/20240916-162644-arnaudb.json [16:26:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69168 and previous config saved to /var/cache/conftool/dbconfig/20240916-162649-arnaudb.json [16:27:16] herron: Just to make sure I understand what you said on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073253, all calls to opensearch-dashboards-plugin should pass `--allow-root` ? [16:27:41] sukhe: I'm just picking up context, is that https://phabricator.wikimedia.org/T373527? [16:27:47] looking at puppet failures, seeing a lot of connection issues to puppetserver1003 [16:27:58] dancy: yes afaik, based on the removes we ran just today. remove errored out at first without the flag [16:28:09] ok will do [16:28:20] dancy: ty! [16:28:39] rzl: swfrench-wmf: seeing both 1002 and 1003 and also noticed that puppet is disabled on these hosts [16:28:42] jhathaway: ^ [16:28:52] dancy: wait no I'm wrong [16:28:58] rzl: I don't think it is thrashing in this case though but I can be wrong [16:29:03] * dancy waits. [16:29:16] sukhe: ah, interesting - ack [16:29:22] just restarted the puppetservers, to test strict variables, errors should recover, but if they don't I will revert [16:29:37] thanks jhathaway [16:29:41] thanks sukhe for the ping [16:29:44] <3 [16:29:48] awesome [16:30:35] dancy: yeah sorry about that the sudo isn't running as root so nevermind me! [16:30:40] ah, makes sense. [16:31:12] herron: Please add your +1 if you're cool w/ the change [16:32:36] dancy: +1'd! [16:33:57] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:34:11] Thanks! [16:35:33] RECOVERY - Host gerrit1004 is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms [16:36:03] jhathaway: I'm interested and a little unsettled that it was 1002 and 1003 at the same time this time [16:37:17] rzl: sorry haven't fully grokked the back log, what happened at the same time? [16:38:15] just that both puppetserver1002 and 1003 started thrashing at around the same time [16:38:25] where previously we'd seen it for individual hosts AIUI [16:38:36] not sure if that makes it a coincidence or something cascadey [16:39:09] (no need to dig into the phatality stuff in the backlog, it's causally unrelated) [16:39:47] I restarted puppetserver on 1002 & 1003 at around the same time, perhaps a diff of 20secs [16:40:23] *oh* I misunderstood, I thought you restarted them because of the errors [16:40:36] but no, we were just getting transient errors because they were mid-restart [16:40:42] never mind that comment then :) [16:41:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69169 and previous config saved to /var/cache/conftool/dbconfig/20240916-164124-arnaudb.json [16:41:29] T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623 [16:41:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69170 and previous config saved to /var/cache/conftool/dbconfig/20240916-164129-arnaudb.json [16:41:32] no prob, I shouldn't have announced more widely, I didn't know a restart would generate that many failures, seems like their should be a more graceful method [16:41:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69171 and previous config saved to /var/cache/conftool/dbconfig/20240916-164134-arnaudb.json [16:41:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69172 and previous config saved to /var/cache/conftool/dbconfig/20240916-164139-arnaudb.json [16:41:40] okay in that case I'm going to go ahead and merge dancy's patch wrt the previous outage [16:41:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69173 and previous config saved to /var/cache/conftool/dbconfig/20240916-164144-arnaudb.json [16:41:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69174 and previous config saved to /var/cache/conftool/dbconfig/20240916-164149-arnaudb.json [16:41:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69175 and previous config saved to /var/cache/conftool/dbconfig/20240916-164154-arnaudb.json [16:41:57] PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:43:40] I am not sure yet why it's down but that is NOT the production gerrit, it's in setup [16:44:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:45:58] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [16:48:26] rzl: I'm going to take a break for a bit. Can I schedule a time with you to retry the prior deployment? 11am pacific? [16:48:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [16:49:02] dancy: sure, works for me -- I also just ran puppet on logstash1032, still depooled, so you can test there at will [16:49:13] ah good I'll try that right now [16:54:26] !log dancy@deploy1003 Installing scap version "4.102.0" for 211 hosts [16:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:56:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69176 and previous config saved to /var/cache/conftool/dbconfig/20240916-165630-arnaudb.json [16:56:35] T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623 [16:56:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69177 and previous config saved to /var/cache/conftool/dbconfig/20240916-165635-arnaudb.json [16:56:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69178 and previous config saved to /var/cache/conftool/dbconfig/20240916-165640-arnaudb.json [16:56:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69179 and previous config saved to /var/cache/conftool/dbconfig/20240916-165645-arnaudb.json [16:56:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69180 and previous config saved to /var/cache/conftool/dbconfig/20240916-165650-arnaudb.json [16:56:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69181 and previous config saved to /var/cache/conftool/dbconfig/20240916-165655-arnaudb.json [16:57:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69182 and previous config saved to /var/cache/conftool/dbconfig/20240916-165700-arnaudb.json [16:58:40] !log dancy@deploy1003 Installation of scap version "4.102.0" completed for 211 hosts [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1700) [17:00:04] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1700). [17:03:10] !log dancy@deploy1003 Installing scap version "4.101.3" for 211 hosts [17:03:29] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [17:05:13] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [17:06:03] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.529 second response time https://wikitech.wikimedia.org/wiki/Docker [17:08:03] !log dancy@deploy1003 Installing scap version "4.101.3" for 1 hosts [17:11:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69183 and previous config saved to /var/cache/conftool/dbconfig/20240916-171136-arnaudb.json [17:11:40] T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623 [17:11:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69184 and previous config saved to /var/cache/conftool/dbconfig/20240916-171140-arnaudb.json [17:11:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69185 and previous config saved to /var/cache/conftool/dbconfig/20240916-171146-arnaudb.json [17:11:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69186 and previous config saved to /var/cache/conftool/dbconfig/20240916-171150-arnaudb.json [17:11:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69187 and previous config saved to /var/cache/conftool/dbconfig/20240916-171155-arnaudb.json [17:12:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69188 and previous config saved to /var/cache/conftool/dbconfig/20240916-171201-arnaudb.json [17:12:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69189 and previous config saved to /var/cache/conftool/dbconfig/20240916-171206-arnaudb.json [17:12:55] !log dancy@deploy1003 Installing scap version "4.101.3" for 2 hosts [17:14:32] !log dancy@deploy1003 Installation of scap version "4.101.3" completed for 2 hosts [17:16:22] !log dancy@deploy1003 Started deploy [releng/phatality@b1a2a70]: testing [17:16:27] !log dancy@deploy1003 Finished deploy [releng/phatality@b1a2a70]: testing (duration: 00m 05s) [17:17:17] rzl: Initial testing on logstash1032 looks good. I'll regroup with you at 11 for full deployment. [17:21:40] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@5ad6710]: (no justification provided) [17:22:25] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@5ad6710]: (no justification provided) (duration: 00m 44s) [17:26:31] dancy: sgtm, thanks! [17:26:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69190 and previous config saved to /var/cache/conftool/dbconfig/20240916-172641-arnaudb.json [17:26:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69191 and previous config saved to /var/cache/conftool/dbconfig/20240916-172646-arnaudb.json [17:26:49] T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623 [17:26:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69192 and previous config saved to /var/cache/conftool/dbconfig/20240916-172651-arnaudb.json [17:26:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69193 and previous config saved to /var/cache/conftool/dbconfig/20240916-172656-arnaudb.json [17:27:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69194 and previous config saved to /var/cache/conftool/dbconfig/20240916-172701-arnaudb.json [17:27:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69195 and previous config saved to /var/cache/conftool/dbconfig/20240916-172706-arnaudb.json [17:27:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69196 and previous config saved to /var/cache/conftool/dbconfig/20240916-172712-arnaudb.json [17:37:18] so that host gerrit1004 that was alerting as host down.. that should not be in monitoring at all [17:37:46] the rename cookbook should have taken care of that PLUS we already manually ran a 'puppet node clean' when it was still in puppetdb [17:37:55] somehow it's still there.. not sure why [17:38:12] spooky [17:38:40] also ran the clean command on both puppetmaster and puppetserver.. so yea.. [17:39:48] maybe I'll go to alert* and manually delete it from Icinga config and run puppet to see if it comes back or not [17:39:49] mutante: seems like puppetdb still has the node [17:39:54] https://puppetboard.wikimedia.org/node/gerrit1004.wikimedia.org [17:40:38] sukhe: but also the dates for catalog run are over a week ago [17:40:46] on that link [17:41:02] yeah. but if you try say gerrit1005, it complains about the node not being there at all [17:41:07] and when I checked for "how to delete from puppetdb" it said "node clean". right? [17:41:08] so it probably is still somewhere? [17:41:24] would you know other ways to delete from the db? [17:42:05] this host was renamed with the rename cookbook so it seems like a bug [17:42:11] not off-hand. but maybe we can look at what the decommission cookbook does? [17:42:21] true, let me do that [17:42:59] I thought I did that.. but not sure now [17:43:08] puppet node clean [17:43:11] puppet node deactivate [17:43:14] https://doc.wikimedia.org/spicerack/v8.8.0/_modules/spicerack/puppet.html#PuppetServer.delete [17:43:29] also apparently, both on server and master [17:43:32] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/decommission.py [17:43:35] puppet_master.delete(fqdn) [17:43:36] puppet_server.delete(fqdn) [17:43:39] ah, good call to check the source [17:43:41] which calls the spicerack function above [17:43:44] trying [17:43:54] mutante: is it possible that you ran the rename before https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1071588 was merged? [17:44:14] I did not run the rename, dcops did. but checking [17:44:36] (guessing by the dates on puppetboard, that sounds plausible) [17:44:49] swfrench-wmf: yes, it was before that fix :) [17:45:05] well, that is great, no need for a new bug report or wondering, yay [17:45:08] thanks [17:45:55] Submitted 'deactivate node' for gerrit1004.wikimedia.org with UUID 7fb4f744-f07d-448e-ad5c-37539b1f334c [17:46:13] no problem - all thanks goes to c.laime for finding and fixing that (all the renames we've been doing for wikikube workers shook out a lot of interesting things) [17:46:19] mutante: nice! [17:46:31] so I had done "clean" but not "deactivate" and now done on both master and server [17:46:37] thanks all [17:46:48] checking icinga [17:49:35] yea, I can see puppet removing the icinga config snippets [17:51:29] (03CR) 10Jgreen: [C:03+2] frack: remove fraban2001 from dns for decommissioning [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) (owner: 10Dwisehaupt) [17:51:32] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10150402 (10Dzahn) >>! In T372817#10129107, @MoritzMuehlenhoff wrote: > @Dzahn gerrit1004 is still in puppetdb: https://puppetboard.wikimedia.org/... [17:52:02] (03PS2) 10Dwisehaupt: frack: remove fraban2001 from dns for decommissioning [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) [17:52:49] (03CR) 10Dzahn: "thanks for this. also ran into it with a renamed host and was wondering for a bit." [cookbooks] - 10https://gerrit.wikimedia.org/r/1071588 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert) [17:53:53] (03Abandoned) 10Dzahn: network: introduce a list of friendly networks [puppet] - 10https://gerrit.wikimedia.org/r/1069387 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [17:55:17] (03PS3) 10Dwisehaupt: frack: remove frban2001 from dns for decommissioning [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) [17:56:48] (03CR) 10Dwisehaupt: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) (owner: 10Dwisehaupt) [18:01:10] dancy: standing by, no rush [18:02:51] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: decommission frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374741#10150411 (10Dwisehaupt) a:05Dwisehaupt→03None [18:05:17] rzl: Retrying [18:05:45] !log dancy@deploy1003 Started deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 [18:06:31] !log dancy@deploy1003 Finished deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 (duration: 00m 46s) [18:06:53] looking good so far [18:06:59] !log dancy@deploy1003 Started deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 [18:07:04] !log dancy@deploy1003 Finished deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 (duration: 00m 04s) [18:07:20] (03CR) 10Volans: [C:03+1] "LGTM, nice catch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [18:08:29] https://www.irccloud.com/pastebin/QO2CnBa3/ [18:09:11] (03CR) 10Ssingh: "One question: you should not have been hitting dns2006 when it was unreachable during this period. It was depooled for all services, so sh" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [18:09:17] Seems like `restart_dashboards` should not run if `install_zip` fails.. I'll file a ticket for that issue. [18:09:48] oh, makes sense [18:10:05] I saw the service restart and figured that was good news :P [18:10:29] (03CR) 10Dzahn: [C:04-1] "This is outdated since meanwhile we switched from counting packets to counting connections." [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [18:11:58] (03CR) 10Dzahn: [C:04-1] "yep, waiting for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072690" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:12:17] (03CR) 10Dzahn: [C:04-1] "needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072690 first" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:12:34] (03CR) 10Ssingh: "Thanks! Is it fine to abandon this then?" [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [18:13:02] (03PS13) 10Dzahn: phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) [18:13:18] (03CR) 10Dzahn: "need to scheduled a downtime for the needed reboot" [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:14:31] (03CR) 10Ssingh: "https://sal.toolforge.org/log/16DX5pEBFFSCpsJztaYX the exact time when it was depooled if it helps! (Since I am not sure when the cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [18:16:40] (03PS1) 10Volans: mysql_legacy: small fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073274 [18:16:47] (03CR) 10Volans: [C:03+2] mysql_legacy: instance improvements (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [18:17:17] (03PS9) 10Volans: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) [18:18:43] (03CR) 10Volans: "Replies inline, CI failure is because the change in spicerack has not yet been released." [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [18:19:59] (03CR) 10Dzahn: [C:04-1] "per IRC chat: I will amend" [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [18:21:12] rzl: Can you `chown -R opensearch-dashboards: /usr/share/opensearch-dashboards/plugins/phatality` on all the logstash hosts? [18:22:05] yep [18:22:34] that won't get put back by puppet or anything? [18:24:14] I think it's the repair operations that were run today that caused them to be owned by root [18:24:27] ah got it [18:24:45] !log rzl@cumin1002:~$ sudo cumin O:logging::opensearch::collector 'chown -R opensearch-dashboards: /usr/share/opensearch-dashboards/plugins/phatality' [18:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:24] (03CR) 10Scott French: "Thanks for the review, Riccardo!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [18:27:39] !log dancy@deploy1003 Started deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 [18:27:43] !log dancy@deploy1003 Finished deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 (duration: 00m 04s) [18:27:51] !log dancy@deploy1003 Started deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 [18:27:57] !log dancy@deploy1003 Finished deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 (duration: 00m 06s) [18:28:06] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989#10150478 (10CDanis) Today we saw another good use case for `sudo_pair`: while troubleshooting and firefighting a #phatality deploy gone wrong (T374880), several... [18:28:19] OK.. New stuff should be fully deployed now.. No errors. Thanks a lot rzl and herron! [18:28:33] sweet [18:28:39] thanks for the quick response dancy [18:29:08] (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [18:30:12] (03PS3) 10Dzahn: durum: include throttling class, enable it on durum2001, accept/log only [puppet] - 10https://gerrit.wikimedia.org/r/1059156 [18:30:23] (03CR) 10CI reject: [V:04-1] durum: include throttling class, enable it on durum2001, accept/log only [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [18:31:07] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10150492 (10phaultfinder) [18:31:50] (03PS1) 10Jdlrobson: Deploy Vector 2022 on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255) [18:32:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255) (owner: 10Jdlrobson) [18:33:31] (03PS4) 10Dzahn: durum: include throttling class, enable it on durum2001, accept/log only [puppet] - 10https://gerrit.wikimedia.org/r/1059156 [18:36:13] (03CR) 10Ssingh: "Ah thank you. Yeah, I guess we should update this, given:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [18:41:38] (03CR) 10Ssingh: "So I think this is what it should look like (volans is in CC and can comment if he thinks this make sense):" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [18:42:32] (03PS2) 10Jdlrobson: Deploy Vector 2022 on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255) [18:47:32] rzl: dancy: on -observability channel there were alerts for the dashboards / logstash hosts. some resolved, some not yet [18:49:06] (03PS1) 10Jdlrobson: Disable quick surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743) [18:49:22] mutante: they all look resolved to me, which ones do you see still open? [18:49:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743) (owner: 10Jdlrobson) [18:49:53] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis (route NDA requests to Rachel until September 27) - https://phabricator.wikimedia.org/T374673#10150557 (10Dzahn) [18:51:35] rzl: ehm, no you are right, they all resolved now. I was just confused by the order of alerts and the 159 :p other active alerts :) [18:51:43] 👍 [19:02:46] (03CR) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [19:03:00] (03PS4) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 [19:10:15] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis (route NDA requests to Rachel until September 27) - https://phabricator.wikimedia.org/T374673#10150590 (10Dzahn) [19:10:28] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis (route NDA requests to Rachel until September 30) - https://phabricator.wikimedia.org/T374673#10150591 (10Dzahn) [19:14:21] (03PS1) 10JHathaway: k8s::kubelet: fix deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1073281 [19:15:59] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073281 (owner: 10JHathaway) [19:18:15] (03PS1) 10AOkoth: vrts: change primary host [puppet] - 10https://gerrit.wikimedia.org/r/1073283 (https://phabricator.wikimedia.org/T373420) [19:29:14] (03PS3) 10Jdlrobson: Deploy Vector 2022 on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255) [19:29:14] (03PS2) 10Jdlrobson: Disable quick surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743) [19:32:41] (03PS2) 10JHathaway: k8s::kubelet: fix deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1073281 [19:32:48] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073281 (owner: 10JHathaway) [19:35:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10150678 (10phaultfinder) [19:53:13] (03PS1) 10JHathaway: puppet8: enable strict mode [puppet] - 10https://gerrit.wikimedia.org/r/1073284 (https://phabricator.wikimedia.org/T372664) [19:53:29] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073284 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T2000). nyaa~ [20:00:04] Krinkle and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10150757 (10phaultfinder) [20:02:30] o/ [20:04:48] (03PS4) 10Jdlrobson: Deploy Vector 2022 on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255) [20:04:48] (03PS3) 10Jdlrobson: Disable quick surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743) [20:06:49] Go ahead, I might do mine later. need to be afk for a bit [20:14:39] We're gonna do a quick deploy of Jdlrobson's two patches [20:15:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255) (owner: 10Jdlrobson) [20:15:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743) (owner: 10Jdlrobson) [20:16:02] (03Merged) 10jenkins-bot: Deploy Vector 2022 on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255) (owner: 10Jdlrobson) [20:16:06] (03Merged) 10jenkins-bot: Disable quick surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743) (owner: 10Jdlrobson) [20:16:19] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1073277|Deploy Vector 2022 on small wikis (T374255)]], [[gerrit:1073279|Disable quick surveys (T374743)]] [20:16:24] T374255: Deploy Vector 2022 on small wikis - https://phabricator.wikimedia.org/T374255 [20:16:24] T374743: Disable quick surveys for experiments - https://phabricator.wikimedia.org/T374743 [20:17:06] I'm eating some really good leftover sushi - it's a spicy tuna roll with albacore on top and a bit of garlic butter [20:17:20] In case anyone was wondering [20:18:14] haha [20:18:19] Sounds great [20:18:34] !log toyofuku@deploy1003 jdlrobson, toyofuku: Backport for [[gerrit:1073277|Deploy Vector 2022 on small wikis (T374255)]], [[gerrit:1073279|Disable quick surveys (T374743)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:18:43] Jdlrobson: ready for you to test! [20:19:10] on it [20:20:18] @toyofuku good to sync! [20:20:23] yeet [20:20:26] !log toyofuku@deploy1003 jdlrobson, toyofuku: Continuing with sync [20:21:58] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:24:51] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:25:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10150822 (10phaultfinder) [20:25:12] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073277|Deploy Vector 2022 on small wikis (T374255)]], [[gerrit:1073279|Disable quick surveys (T374743)]] (duration: 08m 53s) [20:25:18] T374255: Deploy Vector 2022 on small wikis - https://phabricator.wikimedia.org/T374255 [20:25:18] T374743: Disable quick surveys for experiments - https://phabricator.wikimedia.org/T374743 [20:25:23] All done! [20:25:24] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:25:26] Thank you everyone [20:27:21] toyofuku: thank you for the garlic butter inspiration, that never would have occurred to me [20:28:38] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add frack new switches - pt1979@cumin2002" [20:28:53] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10150834 (10Papaul) [20:28:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add frack new switches - pt1979@cumin2002" [20:28:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:42:02] (03PS2) 10JHathaway: puppet8: enable strict mode [puppet] - 10https://gerrit.wikimedia.org/r/1073284 (https://phabricator.wikimedia.org/T372664) [20:42:21] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:42:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073284 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:43:07] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10150882 (10phaultfinder) [20:47:30] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add frack new switches - pt1979@cumin2002" [20:47:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add frack new switches - pt1979@cumin2002" [20:47:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:50:09] (03CR) 10JHathaway: [C:03+2] puppet8: enable strict mode [puppet] - 10https://gerrit.wikimedia.org/r/1073284 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:53:24] !log reloading puppetserver to enable strict mode [20:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:59:58] (03PS1) 10Btullis: Move the misc_crons dumper role from snapshot1017 to snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1073289 (https://phabricator.wikimedia.org/T366555) [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T2100). [21:01:20] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4000/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073289 (https://phabricator.wikimedia.org/T366555) (owner: 10Btullis) [21:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:05:29] (03CR) 10Btullis: [V:03+1] "I haven't tried this technique of role switching before, but I'm hoping it will allow us to reboot snapshot1017 without interrupting the m" [puppet] - 10https://gerrit.wikimedia.org/r/1073289 (https://phabricator.wikimedia.org/T366555) (owner: 10Btullis) [21:09:46] (03PS1) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [21:10:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:19:09] (03CR) 10Scott French: "Thanks for the pointer to where something similar has been done elsewhere, @ssingh@wikimedia.org!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [21:23:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:24:33] (03PS1) 10Scott French: sre.switchdc.mediawiki: show TTL sleep end time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) [21:28:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:34:18] (03PS2) 10Scott French: sre.switchdc.mediawiki: show TTL sleep end time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) [21:38:13] (03CR) 10Scott French: "@effie@wikimedia.org, this is follow up from our discussion during the live-test earlier today." [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [21:47:19] (03CR) 10CI reject: [V:04-1] sre.switchdc.mediawiki: show TTL sleep end time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [21:55:05] (03PS1) 10JHathaway: mydumper: rename metaparam [puppet] - 10https://gerrit.wikimedia.org/r/1073292 [21:57:06] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10151015 (10jhathaway) 05Open→03Resolved a:03jhathaway enabled in production, closing [21:57:34] (03CR) 10CI reject: [V:04-1] mydumper: rename metaparam [puppet] - 10https://gerrit.wikimedia.org/r/1073292 (owner: 10JHathaway) [21:57:57] (03PS3) 10Scott French: sre.switchdc.mediawiki: show TTL sleep end time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) [22:30:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10151056 (10phaultfinder) [22:49:46] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [22:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10151170 (10phaultfinder) [23:16:24] (03PS1) 10Stoyofuku-wmf: Deploy new donate link location to pilot wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073297 (https://phabricator.wikimedia.org/T373585) [23:16:49] (03PS1) 10Stoyofuku-wmf: Deploy new donate link location to pilot wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073297 (https://phabricator.wikimedia.org/T373585) [23:17:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073297 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [23:25:13] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10151188 (10phaultfinder) [23:27:13] (03CR) 10Bartosz Dziewoński: Improve $wgFooterIcons override, remove $wmgWikimediaIcon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński) [23:27:19] (03PS3) 10Bartosz Dziewoński: Improve $wgFooterIcons override, simplify $wmgWikimediaIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 [23:38:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073300 [23:38:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073300 (owner: 10TrainBranchBot) [23:58:23] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T374897 (10phaultfinder) 03NEW