[00:04:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072928 (owner: 10TrainBranchBot)
[00:10:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[00:15:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[00:54:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[00:55:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10147257 (10phaultfinder)
[01:10:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10147258 (10phaultfinder)
[02:23:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[02:39:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:59:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:36:56] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:36:56] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:37:46] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.311 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:37:48] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52630 bytes in 0.420 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:54:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:04:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1125.eqiad.wmnet with reason: testing node
[06:05:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1125.eqiad.wmnet with reason: testing node
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:23:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[06:24:03] <moritzm>	 !log installing git security updates
[06:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:10] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 217, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:37:10] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:40:52] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:41:10] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 217, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:42:32] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 39 probes of 770 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:47:34] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 33 probes of 770 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:54:52] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:55:10] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:59:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:43] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10147386 (10Vgutierrez) gentle reminder, this is still waiting for @VPuffetMichel approval
[07:15:00] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1073035 (https://phabricator.wikimedia.org/T374804)
[07:16:52] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1073037 (https://phabricator.wikimedia.org/T374805)
[07:17:18] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1073038 (https://phabricator.wikimedia.org/T374806)
[07:17:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] icinga: remove frban2001 for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1072813 (https://phabricator.wikimedia.org/T374741) (owner: 10Dwisehaupt)
[07:18:00] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1073039 (https://phabricator.wikimedia.org/T374807)
[07:22:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374808 (10ops-monitoring-bot) 03NEW
[07:23:02] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10147444 (10ABran-WMF) [] db2129: cm s6 T374806→switchback [] db2140: m s4 T374804 [] db2218: m s7 T374807
[07:23:54] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw
[07:23:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10147440 (10ABran-WMF) [] db2213: m s5 T374805 [] db2214: m s6 T374806
[07:24:10] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10147456 (10ABran-WMF) [] db2220: cm s7 T374807→switchback
[07:27:51] <wikibugs>	 (03PS1) 10Elukey: Set puppet7 for chartmuseum2001 [puppet] - 10https://gerrit.wikimedia.org/r/1073107 (https://phabricator.wikimedia.org/T331969)
[07:29:49] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update thumbor-eqiad to poolcounter1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072716 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[07:33:04] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync
[07:33:09] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[07:33:16] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s5 T374805
[07:33:20] <stashbot>	 T374805: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T374805
[07:33:56] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T374805
[07:35:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2123 from API/vslow/dump T374805', diff saved to https://phabricator.wikimedia.org/P69126 and previous config saved to /var/cache/conftool/dbconfig/20240916-073521-arnaudb.json
[07:36:10] <wikibugs>	 (03CR) 10Brouberol: flink-app: customize calico label selector (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[07:39:50] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[07:40:28] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Set puppet7 for chartmuseum2001 [puppet] - 10https://gerrit.wikimedia.org/r/1073107 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey)
[07:40:30] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1073037 (https://phabricator.wikimedia.org/T374805) (owner: 10Gerrit maintenance bot)
[07:41:02] <arnaudb>	 go for it elukey 
[07:41:06] <elukey>	 arnaudb: ack!
[07:42:12] <arnaudb>	 neat
[07:42:19] <arnaudb>	 thanks
[07:42:34] <arnaudb>	 !log Starting s5 codfw failover from db2213 to db2123 - T374805
[07:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:38] <stashbot>	 T374805: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T374805
[07:43:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2123 to s5 primary T374805', diff saved to https://phabricator.wikimedia.org/P69128 and previous config saved to /var/cache/conftool/dbconfig/20240916-074312-arnaudb.json
[07:43:32] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[07:45:10] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum2001.codfw.wmnet with OS bookworm
[07:45:23] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969#10147504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm
[07:47:39] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Update the Debian changelog to build on Bookworm [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey)
[07:48:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s6 T374806
[07:48:44] <stashbot>	 T374806: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T374806
[07:49:22] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s6 T374806
[07:51:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2213 T374805', diff saved to https://phabricator.wikimedia.org/P69129 and previous config saved to /var/cache/conftool/dbconfig/20240916-075059-arnaudb.json
[07:51:05] <stashbot>	 T374805: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T374805
[07:53:50] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147518 (10MoritzMuehlenhoff)
[07:54:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update the Debian changelog to build on Bookworm [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey)
[07:54:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job chartmuseum in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:56:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] CAS: Disable memcached on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1070899 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[07:59:37] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1073038 (https://phabricator.wikimedia.org/T374806) (owner: 10Gerrit maintenance bot)
[08:01:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2129 to s6 primary T374806', diff saved to https://phabricator.wikimedia.org/P69130 and previous config saved to /var/cache/conftool/dbconfig/20240916-080132-arnaudb.json
[08:01:37] <stashbot>	 T374806: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T374806
[08:03:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2214 T374806', diff saved to https://phabricator.wikimedia.org/P69131 and previous config saved to /var/cache/conftool/dbconfig/20240916-080342-arnaudb.json
[08:07:10] <wikibugs>	 (03CR) 10MVernon: [C:04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1071609 (owner: 10Muehlenhoff)
[08:08:41] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:08:50] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147562 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm executed with errors: - chartmuseum2001 (...
[08:09:09] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:09:20] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:12:37] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: trim 5m retention to 35w [puppet] - 10https://gerrit.wikimedia.org/r/1073147 (https://phabricator.wikimedia.org/T351927)
[08:13:29] <wikibugs>	 (03CR) 10DCausse: flink-app: customize calico label selector (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[08:13:45] <wikibugs>	 10SRE-swift-storage, 06Commons: 404 error opening a specific file on Commons - https://phabricator.wikimedia.org/T374773#10147579 (10KTT-Commons) Update: As of 16:00 UTC+8, I can now access the file without problem in Hong Kong. Will like to hear if anyone elsewhere still has trouble in accessing the file?
[08:14:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] thanos: trim 5m retention to 35w [puppet] - 10https://gerrit.wikimedia.org/r/1073147 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi)
[08:14:45] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:14:57] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm executed with errors: - chartmuseum2001 (...
[08:15:07] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:15:18] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:18:53] <wikibugs>	 (03CR) 10Volans: [C:03+2] test-cookbook: read spicerack config with sudo [puppet] - 10https://gerrit.wikimedia.org/r/1071810 (owner: 10Volans)
[08:19:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job chartmuseum in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:19:22] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:19:32] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm executed with errors: - chartmuseum2001 (...
[08:24:58] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:25:13] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147634 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:33:04] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:33:20] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147647 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm executed with errors: - chartmuseum2001 (...
[08:37:10] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:37:22] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147668 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm
[08:38:16] <wikibugs>	 (03Abandoned) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[08:38:33] <wikibugs>	 (03Abandoned) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[08:38:37] <moritzm>	 !log bump memory allocation of chartmuseum1001/2001 to 2G (Bookworm fails to install with just 1G) T331969
[08:38:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:40] <stashbot>	 T331969: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969
[08:39:09] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10147672 (10elukey) Due to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1035854, the VM's RAM was bumped to 2G.
[08:45:26] <wikibugs>	 (03PS1) 10Santiago Faci: MPIC: New deployment (v0.1.5) to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073152 (https://phabricator.wikimedia.org/T361346)
[08:53:00] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on chartmuseum2001.codfw.wmnet with reason: host reimage
[08:53:18] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add kge to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1073154 (https://phabricator.wikimedia.org/T374813)
[08:54:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[08:55:59] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on chartmuseum2001.codfw.wmnet with reason: host reimage
[08:56:04] <elukey>	 jouncebot: next
[08:56:04] <jouncebot>	 In 1 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1000)
[08:57:11] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10147815 (10ArthurTaylor) I'm happy to use `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5...
[08:57:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10147819 (10phaultfinder)
[08:57:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Enable BFD on 'core' EBGP peerings from L3 switches to CRs - https://phabricator.wikimedia.org/T374452#10147824 (10ayounsi) Not sure it's worth it for direct (short) links. The tradeoff is to rely on an extra protocol, extra config, and adding load on the device...
[08:58:05] <wikibugs>	 10SRE-swift-storage, 06Commons: 404 error opening a specific file on Commons - https://phabricator.wikimedia.org/T374773#10147831 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I've confirmed that both eqiad and codfw swift clusters have this object. They arrived at different times, however:...
[09:01:53] <wikibugs>	 (03CR) 10DCausse: [C:03+1] Add ORKG triplestore to WDQS federation allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1072723 (https://phabricator.wikimedia.org/T366485) (owner: 10Btullis)
[09:02:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: ToR server-move Netbox script adding ".0" to end of interface names - https://phabricator.wikimedia.org/T374024#10147850 (10ayounsi) @cmooney thanks for your patch ! is there something left to do on this ?
[09:03:14] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[09:03:26] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrading mariadb on clouddb1013 T365424
[09:03:30] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[09:03:42] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrading mariadb on clouddb1013 T365424
[09:03:56] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: add new poolcounter nodes to MW configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072717 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[09:04:06] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s1
[09:04:08] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3
[09:09:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10147869 (10elukey) @Jhancock.wm you are totally right, thanks a lot! I was able to force PXE on a 10G port setting the the first `RSC-W-66G4` option to `Legacy`. I hope...
[09:09:06] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1013.eqiad.wmnet
[09:12:24] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1013.eqiad.wmnet
[09:12:30] <icinga-wm>	 PROBLEM - Host clouddb1013 is DOWN: PING CRITICAL - Packet loss = 100%
[09:12:36] <icinga-wm>	 RECOVERY - Host clouddb1013 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms
[09:12:42] <icinga-wm>	 PROBLEM - mysqld processes on clouddb1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[09:12:46] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s1 on clouddb1013 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:13:14] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s3 on clouddb1013 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:13:14] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s1 on clouddb1013 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:13:30] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:13:32] <icinga-wm>	 PROBLEM - MariaDB read only s1 on clouddb1013 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:13:32] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s1 on clouddb1013 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:13:34] <icinga-wm>	 PROBLEM - MariaDB read only s3 on clouddb1013 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:13:34] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s3 on clouddb1013 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:14:26] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[09:15:14] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s1 on clouddb1013 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:15:34] <icinga-wm>	 RECOVERY - MariaDB read only s1 on clouddb1013 is OK: Version 10.6.19-MariaDB, Uptime 51s, read_only: True, event_scheduler: False, 2013.38 QPS, connection latency: 0.017547s, query latency: 0.000476s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:15:34] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s1 on clouddb1013 is OK: Version 10.6.19-MariaDB, Uptime 51s, read_only: True, event_scheduler: False, 2036.61 QPS, connection latency: 0.028063s, query latency: 0.000533s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:15:46] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s1 on clouddb1013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:16:14] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s3 on clouddb1013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:16:32] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 on clouddb1013 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:16:36] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s3 on clouddb1013 is OK: Version 10.6.19-MariaDB, Uptime 46s, read_only: True, event_scheduler: False, 202.07 QPS, connection latency: 0.019260s, query latency: 0.000501s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:16:36] <icinga-wm>	 RECOVERY - MariaDB read only s3 on clouddb1013 is OK: Version 10.6.19-MariaDB, Uptime 46s, read_only: True, event_scheduler: False, 200.30 QPS, connection latency: 0.025202s, query latency: 0.000572s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:16:48] <icinga-wm>	 RECOVERY - mysqld processes on clouddb1013 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[09:17:07] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[09:19:00] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s3
[09:19:09] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s1
[09:21:15] <wikibugs>	 (03PS3) 10Hamish: Configure ContactPage and IPBE contact form on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998)
[09:21:35] <elukey>	 !log copy python3-docker-report from bullseye-wikimedia to bookworm-wikimedia
[09:21:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:47] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s2
[09:21:50] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s7
[09:22:15] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrading mariadb on clouddb1014 T365424
[09:22:19] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[09:22:30] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrading mariadb on clouddb1014 T365424
[09:25:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10147910 (10Ladsgroup) That ssh key is your production key not WMCS.
[09:26:06] <wikibugs>	 (03PS1) 10Elukey: debian: update the target distribution to bookworm-wikimedia [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969)
[09:26:11] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1014.eqiad.wmnet
[09:26:39] <wikibugs>	 (03CR) 10Volans: "Inline the 3 quick changes needed to test it on test-s4 as it's not part of the CORE_SECTIONS" [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[09:26:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#10147916 (10ayounsi) Awesome, great to see progress here !
[09:28:31] <wikibugs>	 (03PS2) 10Elukey: debian: update the target distribution to bookworm-wikimedia [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969)
[09:29:29] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1014.eqiad.wmnet
[09:29:34] <icinga-wm>	 PROBLEM - Host clouddb1014 is DOWN: PING CRITICAL - Packet loss = 100%
[09:29:34] <icinga-wm>	 RECOVERY - Host clouddb1014 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms
[09:29:50] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:29:50] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s2 on clouddb1014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:30:11] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 on clouddb1014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:30:11] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:31:12] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 on clouddb1014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:31:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#10147929 (10aborrero) 05Open→03Resolved a:03aborrero it seems there is agreement in the addressing plan. Marking as resolved, will work on {T374712} next.
[09:31:34] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10147944 (10ArthurTaylor) Yup. I have it noted as my production key. I don't...
[09:31:50] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s2 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:31:50] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s7 on clouddb1014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:32:12] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s7 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:32:29] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002"
[09:34:20] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s7
[09:34:25] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s2
[09:35:09] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrading mariadb on clouddb1015 T365424
[09:35:13] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[09:35:24] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrading mariadb on clouddb1015 T365424
[09:35:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] debian: update the target distribution to bookworm-wikimedia [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey)
[09:36:27] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s4
[09:36:30] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s6
[09:38:16] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "I'll deploy it eventually. ooo right now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński)
[09:40:22] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "fine to ignore lintian IMHO" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey)
[09:42:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10147994 (10ayounsi) Short term I think if you add `[4Gbps]` to the interface description, LibreNMS will [[ https://docs.librenms.org/Extensions/Interface-Descript...
[09:42:44] <elukey>	 !log upload helm3 3.11.3-2 to bookworm-wikimedia
[09:42:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:15] <wikibugs>	 10SRE-swift-storage, 06Commons: 404 error opening a specific file on Commons (due to inconsistent state between two swift clusters) - https://phabricator.wikimedia.org/T374773#10147996 (10Aklapper)
[09:44:33] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Add kge to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1073154 (https://phabricator.wikimedia.org/T374813) (owner: 10Gerrit maintenance bot)
[09:45:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10148001 (10hashar) >>! In T373969#10147944, @ArthurTaylor wrote: > Yup. I h...
[09:47:03] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1015.eqiad.wmnet
[09:49:10] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host chartmuseum2001.codfw.wmnet with OS bookworm
[09:49:19] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10148056 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum2001.codfw.wmnet with OS bookworm completed: - chartmuseum2001 (**PASS**)...
[09:50:28] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1015.eqiad.wmnet
[09:50:36] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on clouddb1015 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:50:36] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s6 on clouddb1015 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:50:44] <icinga-wm>	 PROBLEM - mysqld processes on clouddb1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[09:50:44] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s4 on clouddb1015 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:50:44] <icinga-wm>	 PROBLEM - MariaDB read only s4 on clouddb1015 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:50:44] <icinga-wm>	 PROBLEM - MariaDB read only s6 on clouddb1015 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:50:44] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s6 on clouddb1015 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:50:50] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:51:16] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:51:16] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:51:16] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 on clouddb1015 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:56:16] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s4 on clouddb1015 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:56:16] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:56:27] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw
[09:56:36] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s6 on clouddb1015 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:56:36] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:56:44] <icinga-wm>	 RECOVERY - mysqld processes on clouddb1015 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[09:56:44] <icinga-wm>	 RECOVERY - MariaDB read only s4 on clouddb1015 is OK: Version 10.6.19-MariaDB, Uptime 52s, read_only: True, event_scheduler: False, 409.16 QPS, connection latency: 0.020505s, query latency: 0.000507s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:56:45] <icinga-wm>	 RECOVERY - MariaDB read only s6 on clouddb1015 is OK: Version 10.6.19-MariaDB, Uptime 48s, read_only: True, event_scheduler: False, 1426.47 QPS, connection latency: 0.028046s, query latency: 0.000615s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:56:45] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s6 on clouddb1015 is OK: Version 10.6.19-MariaDB, Uptime 48s, read_only: True, event_scheduler: False, 1433.62 QPS, connection latency: 0.017873s, query latency: 0.000433s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:56:45] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s4 on clouddb1015 is OK: Version 10.6.19-MariaDB, Uptime 52s, read_only: True, event_scheduler: False, 457.05 QPS, connection latency: 0.030524s, query latency: 0.000555s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:57:16] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:59:50] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1000)
[10:00:07] <logmsgbot>	 !log elukey@deploy1003 Started scap sync-world: Update network policies to allow the new poolcounter vms.
[10:03:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:03:26] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s6
[10:03:29] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s4
[10:03:32] <logmsgbot>	 !log elukey@deploy1003 Finished scap sync-world: Update network policies to allow the new poolcounter vms. (duration: 04m 35s)
[10:05:12] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:05:36] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s5
[10:05:40] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s8
[10:06:09] <wikibugs>	 (03PS1) 10Slyngshede: Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162
[10:06:16] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1016 T365424
[10:06:19] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[10:06:31] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1016 T365424
[10:07:43] <wikibugs>	 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10148139 (10Vgutierrez) 05Open→03Stalled idp configuration states that `wmf` membership is enough to access superset (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/re...
[10:08:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 (owner: 10Slyngshede)
[10:09:18] <wikibugs>	 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10148146 (10Vgutierrez) a:03Vgutierrez
[10:09:36] <wikibugs>	 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10148149 (10Vgutierrez) a:05Vgutierrez→03Cyndymediawiksim
[10:15:56] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: use VXLAN network as the new default for instance launch [puppet] - 10https://gerrit.wikimedia.org/r/1073163 (https://phabricator.wikimedia.org/T374824)
[10:18:48] <wikibugs>	 (03PS1) 10Elukey: services: remove old poolcounter netpolicies for Thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073164 (https://phabricator.wikimedia.org/T332015)
[10:19:51] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10148243 (10elukey) The reimage of 2001 went fine, I just repooled it. Let's wait for a day before moving to 1001 so if anything weird comes up, we'll have a quick way to fix (depool 2001).  N...
[10:20:35] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10148252 (10elukey) a:05jhathaway→03elukey
[10:22:34] <wikibugs>	 (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-analytics-test [puppet] - 10https://gerrit.wikimedia.org/r/1073166 (https://phabricator.wikimedia.org/T374396)
[10:22:35] <wikibugs>	 (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-analytics-product [puppet] - 10https://gerrit.wikimedia.org/r/1073167 (https://phabricator.wikimedia.org/T374396)
[10:22:36] <wikibugs>	 (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1073168 (https://phabricator.wikimedia.org/T374396)
[10:22:38] <wikibugs>	 (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-research [puppet] - 10https://gerrit.wikimedia.org/r/1073169 (https://phabricator.wikimedia.org/T374396)
[10:22:40] <wikibugs>	 (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-search [puppet] - 10https://gerrit.wikimedia.org/r/1073170 (https://phabricator.wikimedia.org/T374396)
[10:22:41] <wikibugs>	 (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1073171 (https://phabricator.wikimedia.org/T374396)
[10:22:43] <wikibugs>	 (03PS1) 10Brouberol: Upgrade the airflow-dags deb in airflow-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1073172 (https://phabricator.wikimedia.org/T374396)
[10:22:44] <wikibugs>	 (03PS1) 10Brouberol: Install airflow-dags 2.9.3-py3.10-20240916 by default on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1073173 (https://phabricator.wikimedia.org/T374396)
[10:23:22] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: use VXLAN network as the new default for instance launch [puppet] - 10https://gerrit.wikimedia.org/r/1073163 (https://phabricator.wikimedia.org/T374824)
[10:23:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073163 (https://phabricator.wikimedia.org/T374824) (owner: 10Arturo Borrero Gonzalez)
[10:23:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[10:24:34] <wikibugs>	 (03CR) 10Btullis: "> Encryption is something that is ensured at the s3 storage level..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[10:24:50] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-analytics-test [puppet] - 10https://gerrit.wikimedia.org/r/1073166 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[10:25:22] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072546 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[10:25:51] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1016.eqiad.wmnet
[10:27:27] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] services: remove old poolcounter netpolicies for Thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073164 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[10:27:42] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-analytics-product [puppet] - 10https://gerrit.wikimedia.org/r/1073167 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[10:27:52] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1073168 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[10:28:03] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-research [puppet] - 10https://gerrit.wikimedia.org/r/1073169 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[10:28:16] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-search [puppet] - 10https://gerrit.wikimedia.org/r/1073170 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[10:28:42] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1073171 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[10:28:55] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Upgrade the airflow-dags deb in airflow-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1073172 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[10:29:05] <wikibugs>	 (03PS6) 10Brouberol: cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281)
[10:29:05] <wikibugs>	 (03PS2) 10Brouberol: cloudnative-pg-cluster: setup good defaults allowing a cluster to be restored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072546 (https://phabricator.wikimedia.org/T372281)
[10:29:09] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1016.eqiad.wmnet
[10:29:20] <icinga-wm>	 PROBLEM - Host clouddb1016 is DOWN: PING CRITICAL - Packet loss = 100%
[10:29:25] <wikibugs>	 (03CR) 10Brouberol: cloudnative-pg-cluster: enable wal upload / backups to s3 by default (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[10:29:30] <icinga-wm>	 RECOVERY - Host clouddb1016 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms
[10:29:30] <icinga-wm>	 PROBLEM - mysqld processes on clouddb1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[10:29:38] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:29:44] <icinga-wm>	 PROBLEM - MariaDB read only s8 on clouddb1016 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:29:44] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s8 on clouddb1016 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:29:44] <icinga-wm>	 PROBLEM - MariaDB read only s5 on clouddb1016 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:29:44] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s5 on clouddb1016 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:29:51] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Install airflow-dags 2.9.3-py3.10-20240916 by default on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1073173 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[10:29:52] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:29:52] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on clouddb1016 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:30:20] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s8 on clouddb1016 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:30:20] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s8 on clouddb1016 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:30:20] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:30:30] <icinga-wm>	 RECOVERY - mysqld processes on clouddb1016 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[10:30:44] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s5 on clouddb1016 is OK: Version 10.6.19-MariaDB, Uptime 22s, read_only: True, event_scheduler: False, 418.00 QPS, connection latency: 0.011351s, query latency: 0.000283s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:30:44] <icinga-wm>	 RECOVERY - MariaDB read only s5 on clouddb1016 is OK: Version 10.6.19-MariaDB, Uptime 22s, read_only: True, event_scheduler: False, 429.81 QPS, connection latency: 0.028775s, query latency: 0.000484s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:30:44] <icinga-wm>	 RECOVERY - MariaDB read only s8 on clouddb1016 is OK: Version 10.6.19-MariaDB, Uptime 19s, read_only: True, event_scheduler: False, 28.05 QPS, connection latency: 0.029275s, query latency: 0.000401s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:30:44] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s8 on clouddb1016 is OK: Version 10.6.19-MariaDB, Uptime 19s, read_only: True, event_scheduler: False, 46.64 QPS, connection latency: 0.018226s, query latency: 0.000621s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:30:52] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:30:52] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 on clouddb1016 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:31:20] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s8 on clouddb1016 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:31:20] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s8 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:32:20] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:32:38] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:33:45] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s8
[10:33:48] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s5
[10:35:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10148360 (10phaultfinder)
[10:36:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-analytics-test [puppet] - 10https://gerrit.wikimedia.org/r/1073166 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[10:39:22] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update rec-api image in staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073176 (https://phabricator.wikimedia.org/T371515)
[10:47:23] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10148384 (10elukey) I checked for `RSC` in the dump that I made from Redfish, and I see the following:  ` "RSC_WR_6SLOT1PCI_E4_0X16OPROM": "EFI", "RSC_W_66G4SLOT1PCI_E4_...
[10:48:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10148385 (10ayounsi) Note that the Bird exporter is already up and running: https://grafana.wikimedia.org/d/dxbfeGDZk/anycast  We could in theory correl...
[10:49:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM, indeed okay to ignore the failure" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey)
[10:50:19] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1349.eqiad.wmnet
[10:50:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM chartmuseum1001.eqiad.wmnet
[10:55:18] <Lucas_WMDE>	 I’m seeing “Could not resolve host: gerrit.wikimedia.org” errors in various CI jobs (not always but too often for my comfort), were there any DNS changes recently? T374830
[10:55:19] <stashbot>	 T374830: Various CI jobs failing with: Could not resolve host: gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830
[10:56:09] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add ORKG triplestore to WDQS federation allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1072723 (https://phabricator.wikimedia.org/T366485) (owner: 10Btullis)
[10:56:40] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1349.eqiad.wmnet
[10:59:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job chartmuseum in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:59:38] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Great!. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[11:00:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM chartmuseum1001.eqiad.wmnet
[11:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:05:12] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:07:18] <wikibugs>	 (03PS2) 10Slyngshede: Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162
[11:08:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: codfw1dev: use VXLAN network as the new default for instance launch [puppet] - 10https://gerrit.wikimedia.org/r/1073163 (https://phabricator.wikimedia.org/T374824) (owner: 10Arturo Borrero Gonzalez)
[11:21:24] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] aphlict: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[11:21:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Install a NOTICE file [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1073183 (https://phabricator.wikimedia.org/T331969)
[11:22:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Install a NOTICE file [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1073183 (https://phabricator.wikimedia.org/T331969) (owner: 10Muehlenhoff)
[11:23:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712#10148509 (10aborrero) p:05Triage→03Medium
[11:23:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072690 needs to be merged first (and this patch updated to use it)" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[11:26:46] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10148515 (10MoritzMuehlenhoff) >>! In T331969#10148243, @elukey wrote: > The reimage of 2001 went fine, I just repooled it. Let's wait for a day before moving to 1001 so if anything weird come...
[11:29:48] <wikibugs>	 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10148518 (10Cyndymediawiksim) Hi @Vgutierrez , yes am having issues accessing superset on  https://superset.wikimedia.org. See attached image below :  {F57514746}
[11:32:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10148524 (10ABran-WMF) all hosts are depoolable for this task
[11:33:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[11:44:30] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Roll out nftables on both list hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073189
[11:46:44] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1073189 (owner: 10EoghanGaffney)
[11:47:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Note that lists1004 needs to be rebooted to fully effect the change." [puppet] - 10https://gerrit.wikimedia.org/r/1073189 (owner: 10EoghanGaffney)
[11:47:54] <icinga-wm>	 PROBLEM - Router interfaces on cr1-magru is CRITICAL: CRITICAL: host 195.200.68.128, interfaces up: 47, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:48:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:48:48] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:49:27] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update rec-api image in staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073176 (https://phabricator.wikimedia.org/T371515) (owner: 10Kevin Bazira)
[11:56:10] <wikibugs>	 (03PS1) 10Hnowlan: videoscalers: enable error logging on tls terminator envoy [puppet] - 10https://gerrit.wikimedia.org/r/1073192 (https://phabricator.wikimedia.org/T373517)
[11:57:48] <wikibugs>	 (03PS6) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502
[11:58:04] <wikibugs>	 (03CR) 10Effie Mouzeli: app.job: update to job 3.0.0 (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli)
[11:58:46] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[11:58:51] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: setup good defaults allowing a cluster to be restored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072546 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[12:01:03] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] videoscalers: enable error logging on tls terminator envoy [puppet] - 10https://gerrit.wikimedia.org/r/1073192 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[12:01:22] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1072660 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene)
[12:01:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene)
[12:02:22] <wikibugs>	 (03Merged) 10jenkins-bot: cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[12:02:29] <wikibugs>	 (03Merged) 10jenkins-bot: cloudnative-pg-cluster: setup good defaults allowing a cluster to be restored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072546 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[12:02:38] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1073192 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[12:02:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Update the URL of the WikiPathways SPARQL endpoint to use HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/1072734 (https://phabricator.wikimedia.org/T364448) (owner: 10Btullis)
[12:03:13] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] MPIC: New deployment (v0.1.5) to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073152 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci)
[12:04:48] <wikibugs>	 (03CR) 10Muehlenhoff: "LGTM, one additional comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[12:13:31] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review. :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073176 (https://phabricator.wikimedia.org/T371515) (owner: 10Kevin Bazira)
[12:14:23] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update rec-api image in staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073176 (https://phabricator.wikimedia.org/T371515) (owner: 10Kevin Bazira)
[12:17:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Only run puppetserver spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1072505 (owner: 10Muehlenhoff)
[12:18:38] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:19:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1004.wikimedia.org
[12:23:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1004.wikimedia.org
[12:23:30] <moritzm>	 !log installing glibc bugfix updates from bookworm 12.7 point release
[12:23:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:00] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: codfw1dev: have a new bastion host in bastion-codfw1dev-04 [puppet] - 10https://gerrit.wikimedia.org/r/1073205 (https://phabricator.wikimedia.org/T374828)
[12:28:50] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:29:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:30:19] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-analytics-product [puppet] - 10https://gerrit.wikimedia.org/r/1073167 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[12:30:39] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:31:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10148655 (10MoritzMuehlenhoff)
[12:31:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10148656 (10MoritzMuehlenhoff)
[12:33:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072556 (https://phabricator.wikimedia.org/T374621) (owner: 10Hamish)
[12:33:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998) (owner: 10Hamish)
[12:40:05] <hashar>	 jouncebot: refresh
[12:40:05] <jouncebot>	 I refreshed my knowledge about deployments.
[12:40:07] <hashar>	 jouncebot: now
[12:40:07] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 19 minute(s)
[12:40:10] <hashar>	 jouncebot: nowandnext
[12:40:10] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 19 minute(s)
[12:40:10] <jouncebot>	 In 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1300)
[12:40:12] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1073168 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[12:43:55] <wikibugs>	 (03CR) 10Muehlenhoff: "These Cumin aliases don't define individual clusters, but a combination of roles and data centers? We do the same for mariadb as well (rol" [puppet] - 10https://gerrit.wikimedia.org/r/1071609 (owner: 10Muehlenhoff)
[12:44:53] <wikibugs>	 (03PS6) 10Muehlenhoff: puppetmaster::frontend|backend: Read the puppet-merge server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443)
[12:45:34] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Update the URL of the WikiPathways SPARQL endpoint to use HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/1072734 (https://phabricator.wikimedia.org/T364448) (owner: 10Btullis)
[12:49:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[12:50:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-research [puppet] - 10https://gerrit.wikimedia.org/r/1073169 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[12:51:25] <moritzm>	 !log installing node-undici security updates
[12:51:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:36] <_Gerges>	 jouncebot: next
[12:51:36] <jouncebot>	 In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1300)
[12:52:02] <wikibugs>	 (03CR) 10Hamish: Configure ContactPage and IPBE contact form on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998) (owner: 10Hamish)
[12:53:16] * hashar grabs a coffee
[12:54:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[12:55:54] <icinga-wm>	 RECOVERY - Router interfaces on cr1-magru is OK: OK: host 195.200.68.128, interfaces up: 48, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:57:44] <hashar>	 \o/
[12:57:54] <hashar>	 _Gerges: if you are around I will start with your patch
[12:58:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-search [puppet] - 10https://gerrit.wikimedia.org/r/1073170 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[12:58:38] <_Gerges>	 Here
[12:58:56] <icinga-wm>	 PROBLEM - Router interfaces on cr1-magru is CRITICAL: CRITICAL: host 195.200.68.128, interfaces up: 47, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:59:56] <hashar>	 there is something about running `extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --namespace`
[12:59:57] <hashar>	 ;)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1300).
[13:00:05] <jouncebot>	 MatmaRex, _Gerges, and Hamishcz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072848 (https://phabricator.wikimedia.org/T374089) (owner: 10GergesShamon)
[13:00:12] <MatmaRex>	 hi
[13:00:17] <Lucas_WMDE>	 o/
[13:00:18] <hashar>	 hi!
[13:00:32] <Hamishcz>	 hi :|
[13:00:33] <wikibugs>	 (03PS1) 10Hnowlan: videoscaler: bump idle_timeouts for envoy tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/1073209 (https://phabricator.wikimedia.org/T373517)
[13:00:51] <wikibugs>	 (03Merged) 10jenkins-bot: [sewikimedia] Enable signatures in the User-namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072848 (https://phabricator.wikimedia.org/T374089) (owner: 10GergesShamon)
[13:00:53] <hashar>	 MatmaRex: I have no clue why we have to define MW_ENTRY_POINT='static'  since I thought w/static.php was simply reading files from disk :D   But clearly it loads the whole of MediaWiki! \o/
[13:01:09] <hashar>	 that is the first bbackport of the day apparently
[13:01:12] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]]
[13:01:14] <hashar>	 so it might take a while :/
[13:01:17] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Yes: https://gerrit.wikimedia.org/g/mediawiki/core/+/3925b14ffbc1bb95808ae2befc633f4c35cc4e6d/includes/skins/components/SkinComponentFoote" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński)
[13:01:17] <stashbot>	 T374089: Enable signatures in the User-namespace for se.wikimedia.org - https://phabricator.wikimedia.org/T374089
[13:01:17] <hashar>	 ah no
[13:02:09] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[13:02:35] <MatmaRex>	 hashar: it surprised me too. it seems that it's just to load a few helper classes for reading files and serving responses. i feel like maybe it shouldn't run the mediawiki startup code, but i'm definitely not trying to change that now
[13:02:52] <hashar>	 MatmaRex: definitely not -:-]
[13:03:22] <hashar>	 some container image is being pushed
[13:04:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1073171 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[13:04:15] <hashar>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy1003&var-datasource=thanos&var-cluster=misc&from=now-1h&to=now&viewPanel=8
[13:04:15] <Hamishcz>	 hashar: I have a dev version for one of my patches, but I'm not sure the hide-if logic in it is available or not, would u want me to do a test or just sync the stable version?
[13:04:23] <wikibugs>	 10SRE-swift-storage, 06Commons, 07Wikimedia-production-error: API request failed (backend-fail-internal): An unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T337991#10148790 (10GPSLeo) I do not think that this is the same bug. Here the files are indeed not de...
[13:05:23] <wikibugs>	 (03CR) 10MVernon: [C:04-1] "I'm not quite clear on the purpose above just using the role directly, then." [puppet] - 10https://gerrit.wikimedia.org/r/1071609 (owner: 10Muehlenhoff)
[13:05:48] <hashar>	 Hamishcz: no idea, I hvaen't looked at your patches :)
[13:06:08] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] videoscaler: bump idle_timeouts for envoy tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/1073209 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[13:06:23] <Hamishcz>	 okay.. then leave it alone
[13:06:24] <Hamishcz>	 lol
[13:07:17] <hashar>	 13:06:31 docker_pull_k8s:  68% (in-flight: 80; ok: 298; fail: 0; left: 54) -    
[13:07:21] <hashar>	 still progressing
[13:07:30] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] videoscaler: bump idle_timeouts for envoy tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/1073209 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[13:08:12] <wikibugs>	 (03PS1) 10Daimona Eaytoy: beta: Enable CampaignEvents Community List [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073211 (https://phabricator.wikimedia.org/T374617)
[13:09:29] <hashar>	 Hamishcz: one of your change needs to be rebased if you can do that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1072556  :)
[13:09:35] <Daimona>	 Hello there. Would anyone be willing to merge a beta-only patch when the current window is over? TIA!
[13:10:02] <Hamishcz>	 sure
[13:10:52] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Upgrade the airflow-dags deb in airflow-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1073172 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[13:11:18] <hashar>	 13:10:54 K8s deployment progress:  62% (ok: 5; fail: 0; left: 3) \              
[13:11:19] <hashar>	 :/
[13:12:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:12:53] <hashar>	 13:12:38 Finished sync-testservers-k8s (duration: 04m 01s)
[13:12:57] <hashar>	 so yeah that takes a bit of time :/
[13:13:08] <Lucas_WMDE>	 weird
[13:13:37] <logmsgbot>	 !log hashar@deploy1003 gergesshamon, hashar: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:13:37] <logmsgbot>	 !log hashar@deploy1003 Sync cancelled.
[13:13:40] <stashbot>	 T374089: Enable signatures in the User-namespace for se.wikimedia.org - https://phabricator.wikimedia.org/T374089
[13:13:44] <hashar>	 oh fuck that
[13:13:55] <hashar>	 Continue with sync? [y/N]: 13:13:36 Sync cancelled.
[13:13:59] <hashar>	 cause of course NO is the default
[13:14:16] <hashar>	 so if by mistake you have pressed enter in the terminal previously, that causes the sync to cancel
[13:14:17] <hashar>	 ...
[13:14:31] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]]
[13:16:59] <bblack>	 yeah.  It makes you wish that for a CLI tool that has a potentially-long operation followed by a user input prompt, it might make sense to consume and discard all the pending keyboard input before displaying the prompt.
[13:17:15] <bblack>	 but then if you do that, someone will complain that they pre-pressed some input key and it didn't accept it when expected :P
[13:17:35] <hashar>	 sounds like space bar heating problem yeah
[13:17:43] <bblack>	 (but I still think the former is better, if it's a critical confirmation prompt and you want the user to read the output first)
[13:17:47] <hashar>	 I too
[13:17:51] <wikibugs>	 (03PS1) 10Hamish: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621)
[13:18:12] <wikibugs>	 (03PS2) 10Hamish: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621)
[13:18:13] <hashar>	 I guess scap python should do something like sys.stdin.flush()
[13:18:19] <hashar>	 before asking for input
[13:18:44] <Lucas_WMDE>	 “you want the user to read the output first” – agree, I think
[13:18:47] <wikibugs>	 (03PS1) 10Jgreen: Add payments-a-codfw.wikimedia.org 208.80.152.227 A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1073213 (https://phabricator.wikimedia.org/T373942)
[13:18:52] <Lucas_WMDE>	 though I also wonder if it would make sense for this prompt to just not have a default
[13:18:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621) (owner: 10Hamish)
[13:19:06] <Lucas_WMDE>	 Continue with sync? [y/n]
[13:19:10] <Lucas_WMDE>	 and any other input -> repeat the question
[13:20:01] <logmsgbot>	 !log hashar@deploy1003 hashar, gergesshamon: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:20:01] <logmsgbot>	 !log hashar@deploy1003 Sync cancelled.
[13:20:05] <stashbot>	 T374089: Enable signatures in the User-namespace for se.wikimedia.org - https://phabricator.wikimedia.org/T374089
[13:20:08] <wikibugs>	 (03PS3) 10Hamish: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621)
[13:20:08] <hashar>	 fuck
[13:20:10] <hashar>	 raelly
[13:20:13] <hashar>	 there is no other word
[13:20:18] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add an explicit Hiera variable to determine the active swift ring server [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[13:20:20] * hashar logouts and try again
[13:21:04] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]]
[13:21:48] <wikibugs>	 (03Abandoned) 10Hamish: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072556 (https://phabricator.wikimedia.org/T374621) (owner: 10Hamish)
[13:22:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:22:56] <wikibugs>	 (03PS2) 10Brouberol: Install airflow-dags 2.9.3-py3.10-20240916 by default on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1073173 (https://phabricator.wikimedia.org/T374396)
[13:24:39] <MatmaRex>	 hashar: i need to step away for some 20 minutes, brb
[13:24:51] <hashar>	 MatmaRex: yeah don't worry I will handle your patches :)
[13:25:09] <hashar>	 I am opened a brand new terminal and set it aside
[13:25:35] <logmsgbot>	 !log hashar@deploy1003 gergesshamon, hashar: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:25:41] <stashbot>	 T374089: Enable signatures in the User-namespace for se.wikimedia.org - https://phabricator.wikimedia.org/T374089
[13:25:45] <logmsgbot>	 !log hashar@deploy1003 gergesshamon, hashar: Continuing with sync
[13:25:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Install airflow-dags 2.9.3-py3.10-20240916 by default on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1073173 (https://phabricator.wikimedia.org/T374396) (owner: 10Brouberol)
[13:26:21] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add payments-a-codfw.wikimedia.org 208.80.152.227 A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1073213 (https://phabricator.wikimedia.org/T373942) (owner: 10Jgreen)
[13:26:26] <hashar>	 and it takes roughly 4 minutes and 30 seconds of overhead before reaching that prompt
[13:27:04] <_Gerges>	 I did the test, the signature button appears.
[13:27:09] <hashar>	 _Gerges: thank you!
[13:30:12] <Hamishcz>	 hashar, I have to leave at the moment, please forget my patches for this window:)
[13:30:23] <Hamishcz>	 sorry for any inconvenience
[13:30:27] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1
[13:30:29] <hashar>	 Hamishcz: sorry everything is so slow today :/
[13:30:31] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3
[13:30:43] <hashar>	 Hamishcz: I will do the throttling ones at least
[13:30:47] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10148864 (10elukey) Cross-posting from T365167#10148384, where I am testing a reimage for sretest2001.  On sretest2001 we have 10G/25G cap...
[13:30:53] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424
[13:30:57] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[13:31:08] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424
[13:31:13] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424
[13:31:28] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424
[13:33:18] <wikibugs>	 (03CR) 10Ssingh: "zone file changes look OK, I leave the hostname to your expertise :)" [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) (owner: 10Dwisehaupt)
[13:33:21] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] frack: remove fraban2001 from dns for decommissioning [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) (owner: 10Dwisehaupt)
[13:34:48] <hashar>	 I don't know what is happening with kubernetes today
[13:34:58] <hashar>	 it looks like everything is slower than usually
[13:35:36] <hashar>	 that one has been going for 15 minutes already :/
[13:36:39] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072848|[sewikimedia] Enable signatures in the User-namespace (T374089)]] (duration: 15m 35s)
[13:36:41] <sukhe>	 !log sudo cumin "A:cp" 'disable-puppet "merging CR 1072566"'
[13:36:43] <stashbot>	 T374089: Enable signatures in the User-namespace for se.wikimedia.org - https://phabricator.wikimedia.org/T374089
[13:36:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:52] <hashar>	 pff so one done
[13:37:03] <_Gerges>	 :)
[13:37:25] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1017.eqiad.wmnet
[13:37:25] <_Gerges>	 Thanks 
[13:37:47] <hashar>	 !log mwmaint: mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki=sewikimedia --current --namespace 2  # T374089
[13:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072624 (https://phabricator.wikimedia.org/T374286) (owner: 10Bartosz Dziewoński)
[13:38:58] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway)
[13:39:09] <wikibugs>	 (03Merged) 10jenkins-bot: Define MW_ENTRY_POINT in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072624 (https://phabricator.wikimedia.org/T374286) (owner: 10Bartosz Dziewoński)
[13:39:22] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072624|Define MW_ENTRY_POINT in static.php (T374286)]]
[13:39:25] <stashbot>	 T374286: On sso.wikimedia.beta.wmflabs.org login page, the "Powered by MediaWiki" icon does not render - https://phabricator.wikimedia.org/T374286
[13:40:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] puppetmaster::frontend|backend: Read the puppet-merge server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[13:40:36] <wikibugs>	 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10148900 (10Vgutierrez) 05Stalled→03Declined After double checking that I get the very same errors as @Cyndymediawiksim it looks like it's an issue with that specific superset dashboa...
[13:40:42] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1017.eqiad.wmnet
[13:40:48] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s1 on clouddb1017 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:40:48] <icinga-wm>	 PROBLEM - MariaDB read only s1 on clouddb1017 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:40:48] <icinga-wm>	 PROBLEM - MariaDB read only s3 on clouddb1017 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:40:48] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s3 on clouddb1017 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:41:04] <icinga-wm>	 PROBLEM - mysqld processes on clouddb1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[13:41:24] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:41:24] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s1 on clouddb1017 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:41:42] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s3 on clouddb1017 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:41:42] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:42:04] <icinga-wm>	 RECOVERY - mysqld processes on clouddb1017 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[13:42:24] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s1 on clouddb1017 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:42:24] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s1 on clouddb1017 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:42:30] <MatmaRex>	 hashar: (back)
[13:42:40] <wikibugs>	 (03CR) 10Jforrester: Improve $wgFooterIcons override, remove $wmgWikimediaIcon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński)
[13:42:42] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s3 on clouddb1017 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:42:42] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 on clouddb1017 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:42:48] <icinga-wm>	 RECOVERY - MariaDB read only s1 on clouddb1017 is OK: Version 10.6.19-MariaDB, Uptime 51s, read_only: True, event_scheduler: False, 482.27 QPS, connection latency: 0.029644s, query latency: 0.000587s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:42:48] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s1 on clouddb1017 is OK: Version 10.6.19-MariaDB, Uptime 51s, read_only: True, event_scheduler: False, 477.29 QPS, connection latency: 0.029675s, query latency: 0.000602s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:42:48] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s3 on clouddb1017 is OK: Version 10.6.19-MariaDB, Uptime 48s, read_only: True, event_scheduler: False, 907.88 QPS, connection latency: 0.019413s, query latency: 0.000481s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:42:49] <icinga-wm>	 RECOVERY - MariaDB read only s3 on clouddb1017 is OK: Version 10.6.19-MariaDB, Uptime 48s, read_only: True, event_scheduler: False, 903.31 QPS, connection latency: 0.017440s, query latency: 0.000465s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:42:58] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet [reason: testing NOOP CR but depooling to be extra sure]
[13:42:58] <hashar>	 MatmaRex: the MW_ENTRY_POINT patch is being deployed
[13:43:01] <logmsgbot>	 !log hashar@deploy1003 hashar, matmarex: Backport for [[gerrit:1072624|Define MW_ENTRY_POINT in static.php (T374286)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:43:10] <wikibugs>	 (03PS8) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365)
[13:43:20] <hashar>	 MatmaRex: I already ran the beta updater
[13:43:23] <MatmaRex>	 hashar: thanks. currently it only affects the beta cluster
[13:43:29] <logmsgbot>	 !log hashar@deploy1003 hashar, matmarex: Continuing with sync
[13:43:31] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli)
[13:43:32] <hashar>	 \o/
[13:43:46] <hashar>	 beta is syncing the code https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/172385/console
[13:44:03] <hashar>	 I will do Hamushcz throttling patch next
[13:44:10] <hashar>	 hmm no
[13:44:16] <hashar>	 lets do the default to log any error
[13:45:32] <wikibugs>	 (03PS1) 10Hnowlan: envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517)
[13:47:39] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [reason: [done] testing NOOP CR but depooling to be extra sure]
[13:48:14] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli)
[13:48:32] <sukhe>	 !log sudo cumin -b11 "A:cp" 'run-puppet-agent --enable "merging CR 1072566"'
[13:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:54] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s3
[13:48:58] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1
[13:49:25] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072624|Define MW_ENTRY_POINT in static.php (T374286)]] (duration: 10m 03s)
[13:49:29] <stashbot>	 T374286: On sso.wikimedia.beta.wmflabs.org login page, the "Powered by MediaWiki" icon does not render - https://phabricator.wikimedia.org/T374286
[13:49:41] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s2
[13:49:45] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s7
[13:50:04] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrading mariadb on clouddb1018 T365424
[13:50:08] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[13:50:20] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrading mariadb on clouddb1018 T365424
[13:51:18] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3991/console" [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[13:51:54] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:51:56] <icinga-wm>	 RECOVERY - Router interfaces on cr1-magru is OK: OK: host 195.200.68.128, interfaces up: 48, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:52:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:54:09] <hashar>	 ok next
[13:54:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński)
[13:54:55] <wikibugs>	 (03PS3) 10Hashar: logging: Default to log any error (on group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński)
[13:54:55] <effie>	 jouncebot: now
[13:54:55] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1300)
[13:55:01] <effie>	 jouncebot: next
[13:55:01] <jouncebot>	 In 1 hour(s) and 34 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1530)
[13:55:01] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński)
[13:55:38] <hashar>	 effie: I am going to extend the backport window given all the slowness we had earlier
[13:55:55] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Default to log any error (on group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński)
[13:56:05] <effie>	 hashar: no problem, I saw that you lot were busy 
[13:56:07] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072207|logging: Default to log any error (on group0) (T228838)]]
[13:56:10] <stashbot>	 T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838
[13:56:16] <effie>	 tx hashar 
[13:57:20] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1018.eqiad.wmnet
[13:57:43] <wikibugs>	 (03PS9) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365)
[13:58:05] <logmsgbot>	 !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1072207|logging: Default to log any error (on group0) (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:58:20] <hashar>	 and I guess I will escalate wikifunctions
[13:58:25] <hashar>	 cause it is still spamming the logs :D
[13:58:30] <wikibugs>	 (03PS2) 10Hnowlan: envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517)
[13:59:34] <logmsgbot>	 !log hashar@deploy1003 matmarex, hashar: Continuing with sync
[14:00:18] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1018.eqiad.wmnet
[14:00:22] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s7 on clouddb1018 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:00:24] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:00:24] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 on clouddb1018 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:00:46] <icinga-wm>	 PROBLEM - mysqld processes on clouddb1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[14:00:48] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s2 on clouddb1018 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:00:48] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s7 on clouddb1018 is CRITICAL: Could not connect to localhost:3317 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:00:49] <icinga-wm>	 PROBLEM - MariaDB read only s2 on clouddb1018 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:00:49] <icinga-wm>	 PROBLEM - MariaDB read only s7 on clouddb1018 is CRITICAL: Could not connect to localhost:3317 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:00:52] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s2 on clouddb1018 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:00:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[14:01:02] <wikibugs>	 (03PS3) 10Hnowlan: envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517)
[14:01:44] <icinga-wm>	 RECOVERY - mysqld processes on clouddb1018 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[14:01:50] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s2 on clouddb1018 is OK: Version 10.6.19-MariaDB, Uptime 43s, read_only: True, event_scheduler: False, 23.13 QPS, connection latency: 0.018652s, query latency: 0.000348s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:01:50] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s7 on clouddb1018 is OK: Version 10.6.19-MariaDB, Uptime 40s, read_only: True, event_scheduler: False, 22.89 QPS, connection latency: 0.016599s, query latency: 0.000681s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:01:51] <icinga-wm>	 RECOVERY - MariaDB read only s2 on clouddb1018 is OK: Version 10.6.19-MariaDB, Uptime 43s, read_only: True, event_scheduler: False, 22.96 QPS, connection latency: 0.027851s, query latency: 0.000655s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:01:51] <icinga-wm>	 RECOVERY - MariaDB read only s7 on clouddb1018 is OK: Version 10.6.19-MariaDB, Uptime 40s, read_only: True, event_scheduler: False, 22.82 QPS, connection latency: 0.030815s, query latency: 0.000489s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:02:09] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] MPIC: New deployment (v0.1.5) to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073152 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci)
[14:02:26] <wikibugs>	 (03PS1) 10AOkoth: vrts: fix install script [puppet] - 10https://gerrit.wikimedia.org/r/1073224 (https://phabricator.wikimedia.org/T373420)
[14:02:42] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 740.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:03:17] <wikibugs>	 (03Merged) 10jenkins-bot: MPIC: New deployment (v0.1.5) to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073152 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci)
[14:03:22] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s7 on clouddb1018 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:03:22] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s7 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:03:22] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 on clouddb1018 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:03:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[14:03:52] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s2 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:04:06] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072207|logging: Default to log any error (on group0) (T228838)]] (duration: 07m 59s)
[14:04:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Create PCC Puppet 8 nodes - https://phabricator.wikimedia.org/T374495#10149014 (10jhathaway) p:05Triage→03Medium
[14:04:10] <stashbot>	 T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838
[14:04:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621) (owner: 10Hamish)
[14:04:55] <hashar>	 MatmaRex: I am watching https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&refresh=5m&from=now-1h&to=now
[14:05:08] <MatmaRex>	 thanks
[14:05:21] <wikibugs>	 (03Merged) 10jenkins-bot: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073212 (https://phabricator.wikimedia.org/T374621) (owner: 10Hamish)
[14:05:28] <MatmaRex>	 hashar: i'm only half-following now because we have a meeting
[14:05:33] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1073212|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374621)]]
[14:05:36] <stashbot>	 T374621: Lift IP cap on this dates 27/09 and 28/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T374621
[14:05:42] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:05:43] <hashar>	 MatmaRex: no worries, I am watching the logs :)
[14:05:47] <wikibugs>	 (03PS4) 10Hnowlan: envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517)
[14:06:22] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s7
[14:06:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073211 (https://phabricator.wikimedia.org/T374617) (owner: 10Daimona Eaytoy)
[14:06:26] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s2
[14:07:13] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[14:07:26] <logmsgbot>	 !log hashar@deploy1003 hamishz, hashar: Backport for [[gerrit:1073212|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374621)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:07:32] <logmsgbot>	 !log hashar@deploy1003 hamishz, hashar: Continuing with sync
[14:07:37] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4
[14:07:40] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s6
[14:09:50] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrading mariadb on clouddb1019 T365424
[14:09:54] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[14:10:05] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrading mariadb on clouddb1019 T365424
[14:10:54] <hashar>	 MatmaRex: I think it is all good. Happy meeting!
[14:10:55] <wikibugs>	 (03PS5) 10Hnowlan: envoyproxy: add route-level idle timeout for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517)
[14:11:45] <MatmaRex>	 hashar: is it live? i expected to see *some* new errors :o
[14:11:58] <wikibugs>	 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10149030 (10joanna_borun) p:05Triage→03Low
[14:12:10] <hashar>	 I am still digging in logstash
[14:12:14] <wikibugs>	 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10149032 (10elukey)
[14:12:14] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073212|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374621)]] (duration: 06m 41s)
[14:12:17] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1073219 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[14:12:18] <stashbot>	 T374621: Lift IP cap on this dates 27/09 and 28/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T374621
[14:12:19] <hashar>	 + it is only on group0 so far
[14:14:16] <hashar>	 MatmaRex: I will add a patch for group1 and deploy it tomorrow
[14:15:03] <hashar>	 I have deployed everything beside the ContactPage patch https://gerrit.wikimedia.org/r/c/1072876/
[14:15:18] <hashar>	 !log Afternoon backport window is complete
[14:15:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:22] <wikibugs>	 (03CR) 10Volans: "Will this clear up the access?" [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri)
[14:15:24] <hashar>	 effie: ^ I am done
[14:18:21] <wikibugs>	 (03PS15) 10Hashar: logging: Default to log any error (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[14:18:21] <wikibugs>	 (03PS1) 10Hashar: logging: Default to log any error (on group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073232 (https://phabricator.wikimedia.org/T228838)
[14:18:38] <MatmaRex>	 thanks hashar
[14:19:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073232 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[14:19:39] <hashar>	 and it is scheduled!
[14:19:47] <hashar>	 I have no idea how that tool work but it does work!
[14:20:05] <wikibugs>	 (03CR) 10FNegri: "yes because https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/manifests/cumin/target.pp#34" [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri)
[14:20:13] <hashar>	 swfrench-wmf: I am done with the backport window, but please sync with effie who seems to have something pending as well :)
[14:20:34] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Per host access control for kerberized SSH - https://phabricator.wikimedia.org/T276790#10149062 (10joanna_borun) dependent on https://phabricator.wikimedia.org/T244840
[14:20:53] <effie>	 hashar: tx!
[14:21:27] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri)
[14:22:09] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] R:wmcs::db::wikireplicas remove access from cloudcumin [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri)
[14:22:17] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail, 07Surveys: Qualtrics cannot send email to wikimedia.org addresses - https://phabricator.wikimedia.org/T176666#10149077 (10joanna_borun) 05Open→03Declined
[14:22:32] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Remove RPKI rsync alerting [alerts] - 10https://gerrit.wikimedia.org/r/1068019 (owner: 10Ayounsi)
[14:23:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10149067 (10ayounsi) Those won't be in a VC, especially as we didn't pay for the extra VC license :) This means a bit more manual config until...
[14:23:17] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Release-Engineering-Team (Seen): Support running puppet Beaker on CI - https://phabricator.wikimedia.org/T253635#10149074 (10joanna_borun) @hashar is this task still valid?
[14:23:25] <wikibugs>	 (03CR) 10FNegri: [C:03+2] R:wmcs::db::wikireplicas remove access from cloudcumin [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri)
[14:23:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374808#10149088 (10Jhancock.wm) T374422 working on it
[14:23:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:24:08] <wikibugs>	 (03Merged) 10jenkins-bot: Remove RPKI rsync alerting [alerts] - 10https://gerrit.wikimedia.org/r/1068019 (owner: 10Ayounsi)
[14:24:38] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1019.eqiad.wmnet
[14:25:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 05Goal: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#10149092 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is an old umbrella task which is no longer useful by itself. Closing
[14:26:23] <wikibugs>	 10SRE-tools, 10Icinga, 06Infrastructure-Foundations: get-raid-status-perccli should allow for commands to return non-zero exit code - https://phabricator.wikimedia.org/T320998#10149097 (10SLyngshede-WMF) p:05Medium→03Low a:03SLyngshede-WMF
[14:27:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli)
[14:27:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, once the prometheus-equivalent alerts are deployed" [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli)
[14:27:37] <wikibugs>	 (03PS4) 10Jelto: sre.gitlab.upgrade:  also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564)
[14:27:47] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1019.eqiad.wmnet
[14:27:48] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s4 on clouddb1019 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:27:48] <icinga-wm>	 PROBLEM - MariaDB read only s6 on clouddb1019 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:27:49] <icinga-wm>	 PROBLEM - MariaDB read only s4 on clouddb1019 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:27:49] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s6 on clouddb1019 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:27:50] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s6 on clouddb1019 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:27:51] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:28:24] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1004.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:28:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238#10149110 (10jhathaway) 05Open→03Resolved a:03jhathaway Fixed with change in the config, also no longer relative, as we are now running Postfix
[14:28:49] <icinga-wm>	 RECOVERY - MariaDB read only s6 on clouddb1019 is OK: Version 10.6.19-MariaDB, Uptime 40s, read_only: True, event_scheduler: False, 656.84 QPS, connection latency: 0.025223s, query latency: 0.000513s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:28:49] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s4 on clouddb1019 is OK: Version 10.6.19-MariaDB, Uptime 44s, read_only: True, event_scheduler: False, 537.16 QPS, connection latency: 0.015571s, query latency: 0.000441s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:28:49] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s6 on clouddb1019 is OK: Version 10.6.19-MariaDB, Uptime 40s, read_only: True, event_scheduler: False, 663.42 QPS, connection latency: 0.017064s, query latency: 0.000397s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:28:49] <icinga-wm>	 RECOVERY - MariaDB read only s4 on clouddb1019 is OK: Version 10.6.19-MariaDB, Uptime 44s, read_only: True, event_scheduler: False, 531.13 QPS, connection latency: 0.025268s, query latency: 0.000450s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:28:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations: keyholder: continue to arm keys if one fails - https://phabricator.wikimedia.org/T227272#10149113 (10joanna_borun) 05Open→03Resolved
[14:28:51] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s6 on clouddb1019 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:30:24] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:30:50] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:31:26] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s6
[14:31:29] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4
[14:31:54] <wikibugs>	 06SRE, 13Patch-For-Review: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088#10149115 (10CDanis) a:03mark I think this is being handled as part of the Ownership WG
[14:33:34] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Improve sre.hosts.decommission (additionally find host yaml files) - https://phabricator.wikimedia.org/T257297#10149234 (10elukey) 05Open→03Declined Probably not needed anymore :)
[14:33:37] <swfrench-wmf>	 hashar: meant to say before, thank you :)
[14:34:04] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Clarify 'wipe bootloader' step in sre.hosts.decommission - https://phabricator.wikimedia.org/T283204#10149250 (10joanna_borun) 05Open→03Declined
[14:34:42] <wikibugs>	 (03PS1) 10JMeybohm: Don't restart(stop,start) ferm on puppet notify, use reload instead [puppet] - 10https://gerrit.wikimedia.org/r/1073233 (https://phabricator.wikimedia.org/T374366)
[14:35:46] <wikibugs>	 (03CR) 10JMeybohm: "I tried to make the comments a bit more clear - not sure if I succeeded with that" [puppet] - 10https://gerrit.wikimedia.org/r/1073233 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm)
[14:36:04] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209#10149255 (10Volans) 05Open→03Resolved a:03Volans The alertmanager support has been in place for a long time. Resolving. Any additional feature wil...
[14:36:06] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Clarify 'wipe bootloader' step in sre.hosts.decommission - https://phabricator.wikimedia.org/T283204#10149270 (10Volans) As there were no agreement here on task and multiple years have passed we decided to close it. Feel free to reopen in case there is more consen...
[14:37:48] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet for datacenter switchover from codfw to eqiad
[14:37:51] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) for datacenter switchover from codfw to eqiad
[14:39:09] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from codfw to eqiad
[14:39:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:28] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) for datacenter switchover from codfw to eqiad
[14:40:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Simplify hiera lookup model - https://phabricator.wikimedia.org/T106404#10149291 (10joanna_borun) 05Open→03Declined
[14:41:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Simplify hiera lookup model - https://phabricator.wikimedia.org/T106404#10149290 (10joanna_borun) It has been working fine for now but we're open for specific proposals.
[14:41:44] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from codfw to eqiad
[14:42:50] <logmsgbot>	 !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[14:43:06] <logmsgbot>	 !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[14:47:30] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) for datacenter switchover from codfw to eqiad
[14:47:43] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from codfw to eqiad
[14:47:59] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) for datacenter switchover from codfw to eqiad
[14:50:12] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:53:56] <jinxer-wm>	 RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:54:20] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from codfw to eqiad
[14:54:20] <logmsgbot>	 !log swfrench@cumin1002 [DRY-RUN] MediaWiki read-only period starts at: 2024-09-16 14:54:20.136310
[14:54:22] <moritzm>	 !log installing gdk-pixbuf security updates
[14:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:35] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) for datacenter switchover from codfw to eqiad
[14:54:48] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from codfw to eqiad
[14:55:23] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) for datacenter switchover from codfw to eqiad
[14:56:03] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from codfw to eqiad
[14:56:17] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) for datacenter switchover from codfw to eqiad
[14:57:11] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from codfw to eqiad
[14:57:15] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) for datacenter switchover from codfw to eqiad
[14:57:25] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from codfw to eqiad
[14:57:30] <logmsgbot>	 !log swfrench@cumin1002 [DRY-RUN] MediaWiki read-only period ends at: 2024-09-16 14:57:30.267664
[14:57:31] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) for datacenter switchover from codfw to eqiad
[14:57:48] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from codfw to eqiad
[14:57:51] <logmsgbot>	 !log root@deploy1003 helmfile [codfw] START helmfile.d/services/mw-jobrunner: sync
[14:58:26] <logmsgbot>	 !log root@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: sync
[14:58:28] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=0) for datacenter switchover from codfw to eqiad
[14:59:08] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from codfw to eqiad
[15:01:11] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) for datacenter switchover from codfw to eqiad
[15:01:23] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from codfw to eqiad
[15:02:03] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) for datacenter switchover from codfw to eqiad
[15:03:17] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from codfw to eqiad
[15:04:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:32] <MatmaRex>	 i wondered what generates more logs, group0 wikis or the beta cluster. it looks like group0 is about 2x the volume (50k vs 25k logging messages per hour).
[15:13:47] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) for datacenter switchover from codfw to eqiad
[15:25:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69134 and previous config saved to /var/cache/conftool/dbconfig/20240916-152556-arnaudb.json
[15:26:00] <stashbot>	 T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623
[15:26:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69135 and previous config saved to /var/cache/conftool/dbconfig/20240916-152601-arnaudb.json
[15:26:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69136 and previous config saved to /var/cache/conftool/dbconfig/20240916-152606-arnaudb.json
[15:26:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69137 and previous config saved to /var/cache/conftool/dbconfig/20240916-152611-arnaudb.json
[15:26:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69138 and previous config saved to /var/cache/conftool/dbconfig/20240916-152616-arnaudb.json
[15:26:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69139 and previous config saved to /var/cache/conftool/dbconfig/20240916-152621-arnaudb.json
[15:26:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 1%: T374623', diff saved to https://phabricator.wikimedia.org/P69140 and previous config saved to /var/cache/conftool/dbconfig/20240916-152626-arnaudb.json
[15:30:05] <jouncebot>	 jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1530)
[15:34:00] <logmsgbot>	 !log dancy@deploy1003 Started deploy [releng/phatality@8ddb2fa]: (no justification provided)
[15:34:16] <logmsgbot>	 !log dancy@deploy1003 Finished deploy [releng/phatality@8ddb2fa]: (no justification provided) (duration: 00m 15s)
[15:36:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash2032.codfw.wmnet, logstash2030.codfw.wmnet, logstash2024.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:36:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1030.eqiad.wmnet, logstash1031.eqiad.wmnet, logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:36:36] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash2031.codfw.wmnet, logstash2032.codfw.wmnet, logstash2030.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:36:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1032.eqiad.wmnet, logstash1030.eqiad.wmnet, logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:36:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:37:21] <akosiaris>	 !incidents
[15:37:22] <sirenbot>	 5171 (UNACKED)  [2x] ProbeDown sre (ip4 kibana7:443 probes/service http_kibana7_ip4)
[15:37:31] <arnaudb>	 !ack 5171
[15:37:32] <sirenbot>	 5171 (ACKED)  [2x] ProbeDown sre (ip4 kibana7:443 probes/service http_kibana7_ip4)
[15:37:36] <akosiaris>	 thanks
[15:37:40] <akosiaris>	 you were faster
[15:37:50] <arnaudb>	 :D
[15:38:29] <rzl>	 yep it's down all right
[15:39:02] <rzl>	 any o11y folks around, doing work on ELK right now?
[15:39:02] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:39:57] <arnaudb>	 there is an alert on their chan just at the time of the issue: FIRING: [12x] SystemdUnitFailed: opensearch-dashboards.service
[15:40:00] <icinga-wm>	 PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:40:38] <rzl>	 dancy: I think your phatality deploy might be the trigger here, are you looking at that already?
[15:40:53] <arnaudb>	 ah the runbook is missing from https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 
[15:41:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69141 and previous config saved to /var/cache/conftool/dbconfig/20240916-154101-arnaudb.json
[15:41:06] <stashbot>	 T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623
[15:41:06] <dancy>	 That's quite possible.  Digging in now.
[15:41:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69142 and previous config saved to /var/cache/conftool/dbconfig/20240916-154106-arnaudb.json
[15:41:09] <herron>	 hey, no not aware of any work no
[15:41:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69143 and previous config saved to /var/cache/conftool/dbconfig/20240916-154112-arnaudb.json
[15:41:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69144 and previous config saved to /var/cache/conftool/dbconfig/20240916-154116-arnaudb.json
[15:41:17] <rzl>	 dancy: thanks <3 lmk if you need anything
[15:41:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69145 and previous config saved to /var/cache/conftool/dbconfig/20240916-154121-arnaudb.json
[15:41:26] <rzl>	 thanks herron
[15:41:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69146 and previous config saved to /var/cache/conftool/dbconfig/20240916-154127-arnaudb.json
[15:41:28] <dancy>	 I got sudo warnings during the deployment.
[15:41:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 2%: T374623', diff saved to https://phabricator.wikimedia.org/P69147 and previous config saved to /var/cache/conftool/dbconfig/20240916-154132-arnaudb.json
[15:41:47] <arnaudb>	 (sorry for the spamlog 😬)
[15:41:51] <logmsgbot>	 !log dancy@deploy1003 Started deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836
[15:41:52] <jynus>	 only ELK unavailability, right ATM ?
[15:42:00] <arnaudb>	 as far as o11y show yep
[15:42:13] <logmsgbot>	 !log dancy@deploy1003 Finished deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 (duration: 00m 21s)
[15:42:16] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:42:18] <dancy>	 hrm
[15:42:26] <icinga-wm>	 RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[15:42:39] <arnaudb>	 👀
[15:42:55] <herron>	 Sep 16 15:42:14 logstash1023 opensearch-dashboards[2609202]: {"type":"log","@timestamp":"2024-09-16T15:42:14Z","tags":["fatal","root"],"pid":2609202,"message":"Error: Plugin with id \"phatality\" is already registered!\n    at MergeMapSubscriber.project 
[15:43:51] <dancy>	 Can someone try running `/usr/bin/systemctl restart opensearch-dashboards` on one of the problem hosts?
[15:44:02] <rzl>	 can do, stand by
[15:44:06] <dancy>	 thx
[15:44:51] <rzl>	 !log rzl@logstash1032:~$ sudo systemctl restart opensearch-dashboards
[15:44:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:30] <rzl>	 looks like it's still dying with the same error herron pasted
[15:45:50] <dancy>	 hrmm.. ok.. I will try rolling back... 
[15:46:33] <logmsgbot>	 !log dancy@deploy1003 Started deploy [releng/phatality@b1a2a70]: Attempting to revert
[15:46:40] <logmsgbot>	 !log dancy@deploy1003 Finished deploy [releng/phatality@b1a2a70]: Attempting to revert (duration: 00m 06s)
[15:46:57] <rzl>	 want another bounce?
[15:47:13] * swfrench-wmf is here as well now, but in a holding pattern for the moment
[15:47:16] <dancy>	 The revert deployment seemed to be successful (no complaints, as opposed to the original attempt)
[15:47:48] <rzl>	 swfrench-wmf: do you mind getting a status doc open?
[15:47:48] <dancy>	 Another bounce couldn't hurt for verification
[15:47:55] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s5
[15:48:02] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s8
[15:48:04] <swfrench-wmf>	 rzl: ack, can do
[15:48:26] <rzl>	 !log rzl@logstash1032:~$ sudo systemctl restart opensearch-dashboards
[15:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:41] <rzl>	 dancy: nop, same error
[15:48:51] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrading mariadb on clouddb1020 T365424
[15:48:54] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[15:49:01] <herron>	 !log logstash1023:/usr/share/opensearch-dashboards/bin# /usr/share/opensearch-dashboards/bin/opensearch-dashboards-plugin remove phatality
[15:49:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:07] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrading mariadb on clouddb1020 T365424
[15:49:12] <herron>	 trying this as a stopgap ^^
[15:49:44] <jinxer-wm>	 FIRING: ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:51:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:52:22] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1020.eqiad.wmnet
[15:52:31] <rzl>	 herron, dancy: logstash1023 is looking healthy after that, should we do it everywhere?
[15:52:41] <dancy>	 Yes please
[15:52:59] <rzl>	 herron: do you want to cumin that out or shall I?
[15:53:21] <jinxer-wm>	 RESOLVED: ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:53:25] <herron>	 rzl: go for it, fwiw I also just depooled logstash1032 and am able to get to the logstash UI at this point
[15:53:38] <rzl>	 oh, sweet
[15:53:39] <herron>	 that 502 seemed to defy health checks?  I'm not sure off hand
[15:53:52] <rzl>	 actually we can leave 1032 depooled and unfixed if dancy wants it for investigation
[15:54:07] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:54:26] <dancy>	 That would be convenient.  The deployment script clearly needs some work.
[15:55:23] <herron>	 ok, I'll leave things as-is for the time being
[15:55:44] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1020.eqiad.wmnet
[15:55:45] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s8 on clouddb1020 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:55:49] <icinga-wm>	 PROBLEM - MariaDB read only s8 on clouddb1020 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:55:49] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s5 on clouddb1020 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:55:49] <icinga-wm>	 PROBLEM - MariaDB read only wikireplica-s8 on clouddb1020 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:55:49] <icinga-wm>	 PROBLEM - MariaDB read only s5 on clouddb1020 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:56:03] <icinga-wm>	 PROBLEM - mysqld processes on clouddb1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[15:56:08] <Hamishcz>	 hashar, sorry i was in the middle of something IRL, thank you for the deployment
[15:56:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69148 and previous config saved to /var/cache/conftool/dbconfig/20240916-155607-arnaudb.json
[15:56:11] <stashbot>	 T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623
[15:56:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69149 and previous config saved to /var/cache/conftool/dbconfig/20240916-155612-arnaudb.json
[15:56:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69150 and previous config saved to /var/cache/conftool/dbconfig/20240916-155617-arnaudb.json
[15:56:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69151 and previous config saved to /var/cache/conftool/dbconfig/20240916-155622-arnaudb.json
[15:56:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69152 and previous config saved to /var/cache/conftool/dbconfig/20240916-155627-arnaudb.json
[15:56:29] <icinga-wm>	 PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:56:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69153 and previous config saved to /var/cache/conftool/dbconfig/20240916-155632-arnaudb.json
[15:56:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 4%: T374623', diff saved to https://phabricator.wikimedia.org/P69154 and previous config saved to /var/cache/conftool/dbconfig/20240916-155637-arnaudb.json
[15:56:51] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s5 on clouddb1020 is OK: Version 10.6.19-MariaDB, Uptime 2s, read_only: True, event_scheduler: False, 22.76 QPS, connection latency: 0.023102s, query latency: 0.007230s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:56:51] <icinga-wm>	 RECOVERY - MariaDB read only s5 on clouddb1020 is OK: Version 10.6.19-MariaDB, Uptime 2s, read_only: True, event_scheduler: False, 22.91 QPS, connection latency: 0.027916s, query latency: 0.000614s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:57:03] <icinga-wm>	 RECOVERY - mysqld processes on clouddb1020 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[15:57:20] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:57:45] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s8 on clouddb1020 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:57:51] <icinga-wm>	 RECOVERY - MariaDB read only wikireplica-s8 on clouddb1020 is OK: Version 10.6.19-MariaDB, Uptime 58s, read_only: True, event_scheduler: False, 1063.60 QPS, connection latency: 0.011424s, query latency: 0.000285s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:57:51] <icinga-wm>	 RECOVERY - MariaDB read only s8 on clouddb1020 is OK: Version 10.6.19-MariaDB, Uptime 58s, read_only: True, event_scheduler: False, 1125.21 QPS, connection latency: 0.012967s, query latency: 0.000319s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:57:56] <rzl>	 herron: sanity check? I'm about to run `cumin 'O:logging::opensearch::collector and not logstash1032.eqiad.wmnet' '/usr/share/opensearch-dashboards/bin/opensearch-dashboards-plugin remove phatality'`
[15:58:26] <dancy>	 rzl: Please leave the broken one installed.
[15:58:45] <rzl>	 yep, `and not logstash1032` will exclude it
[15:58:50] <dancy>	 gotcha
[15:59:21] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org,service=recdns
[15:59:32] <herron>	 rzl: it may need --allow-root as well 
[15:59:42] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org,service=recdns
[15:59:46] <herron>	 I tried to run as opensearch-dashboards myself but got a homedir error so just ran with --allow-root
[15:59:46] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s8
[15:59:49] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s5
[15:59:57] <rzl>	 ack, thanks
[16:00:37] <rzl>	 !log rzl@cumin1002:~$ sudo cumin 'O:logging::opensearch::collector and not logstash1032.eqiad.wmnet' '/usr/share/opensearch-dashboards/bin/opensearch-dashboards-plugin --allow-root remove phatality'
[16:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:03] <rzl>	 failed on logstash1023 with `Unable to remove plugin because of error: "Plugin [phatality] is not installed"`, expected -- succeeded everywhere else
[16:01:31] <icinga-wm>	 RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms
[16:02:12] <herron>	 rzl: nice on thank you
[16:02:15] <herron>	 one*
[16:06:08] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:07:53] <jhathaway>	 !log testing strict mode on puppetservers
[16:07:55] <icinga-wm>	 PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:53] <rzl>	 herron: any reason I shouldn't restart opensearch-dashboards on all those hosts?
[16:09:02] <rzl>	 puppet will do it, but any reason it needs to be staggered?
[16:09:09] <dancy>	 rzl: I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073253 deals w/ the root cause.
[16:09:13] <icinga-wm>	 RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[16:09:33] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:09:40] <swfrench-wmf>	 rzl: it looks like probes are still failing in codfw, were those hosts in the set your cumin run completed on?
[16:10:02] <rzl>	 swfrench-wmf: yes, but they'll need the systemd unit restarted too
[16:10:09] <rzl>	 I'm going to JFDI
[16:10:11] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:10:21] <herron>	 rzl: I'd say check status first, I think systemd did the right thing?
[16:10:26] <swfrench-wmf>	 rzl: ah, got it - ack
[16:10:27] <rzl>	 holding
[16:10:44] <rzl>	 herron: I see it `running` on some hosts, `failed` on others
[16:11:02] <rzl>	 running on the hosts where we've either restarted it by hand or puppet has run in the meantime
[16:11:13] <herron>	 rzl: kk yeah I think if we can limit to the failed hosts that'd be ideal
[16:11:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69155 and previous config saved to /var/cache/conftool/dbconfig/20240916-161113-arnaudb.json
[16:11:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69156 and previous config saved to /var/cache/conftool/dbconfig/20240916-161117-arnaudb.json
[16:11:18] <stashbot>	 T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623
[16:11:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69157 and previous config saved to /var/cache/conftool/dbconfig/20240916-161123-arnaudb.json
[16:11:24] <rzl>	 herron: sure, here goes
[16:11:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69158 and previous config saved to /var/cache/conftool/dbconfig/20240916-161128-arnaudb.json
[16:11:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69159 and previous config saved to /var/cache/conftool/dbconfig/20240916-161133-arnaudb.json
[16:11:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69160 and previous config saved to /var/cache/conftool/dbconfig/20240916-161138-arnaudb.json
[16:11:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 8%: T374623', diff saved to https://phabricator.wikimedia.org/P69161 and previous config saved to /var/cache/conftool/dbconfig/20240916-161143-arnaudb.json
[16:12:27] <rzl>	 !log rzl@cumin1002:~$ sudo cumin logstash[2023,2025,2030-2032].codfw.wmnet,logstash[1025,1030,1032].eqiad.wmnet 'systemctl restart opensearch-dashboards' # only hosts where status is failed
[16:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:12:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:12:43] <rzl>	 dancy: and seen, thanks, I'll take a proper look in a sec
[16:12:45] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:13:18] <rzl>	 oops I should have excluded 1032 from that restart, but it's a no-op anyway
[16:13:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:14:09] <rzl>	 I'm now seeing all hosts but 1032 healthy, and we should be fully recovered -- anyone still see impact?
[16:14:51] <herron>	 awesome thank you rzl
[16:15:11] <swfrench-wmf>	 thank you, rzl!
[16:15:25] <dancy>	 Thanks! That was stressful
[16:16:11] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:16:24] <dancy>	 I do wonder how `upgrade-phatality.sh` ever worked before.
[16:16:34] <rzl>	 great quesiton
[16:16:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:17:19] <dancy>	 Last change to it was in 2022. Maybe it hasn't worked since then.   :-) 
[16:18:06] <dancy>	 hmm.. doesn't look like sudo is even required for the list command.
[16:18:40] <rzl>	 yeah I just noticed the same
[16:18:47] <dancy>	 I'll update the script.
[16:19:09] <rzl>	 cool -- if you do end up wanting to merge the sudoers change let me know
[16:21:11] <dancy>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073256
[16:22:18] <logmsgbot>	 !log ebernhardson@deploy1003 Started deploy [airflow-dags/search@5ad6710]: standardize created file permissions
[16:22:41] <logmsgbot>	 !log ebernhardson@deploy1003 Finished deploy [airflow-dags/search@5ad6710]: standardize created file permissions (duration: 00m 22s)
[16:23:51] <rzl>	 dancy: LGTM, ready for me to merge it or do you want any other review first?
[16:24:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:25:00] <dancy>	 rzl: Merge please
[16:25:24] <sukhe>	 widespread puppet failures in eqiad
[16:25:30] <sukhe>	 I suspect it is puppetserver1002 acting up again
[16:25:41] <rzl>	 yeah let's get that figured out first but then I'll go ahead
[16:26:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69162 and previous config saved to /var/cache/conftool/dbconfig/20240916-162618-arnaudb.json
[16:26:23] <stashbot>	 T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623
[16:26:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69163 and previous config saved to /var/cache/conftool/dbconfig/20240916-162623-arnaudb.json
[16:26:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69164 and previous config saved to /var/cache/conftool/dbconfig/20240916-162629-arnaudb.json
[16:26:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69165 and previous config saved to /var/cache/conftool/dbconfig/20240916-162633-arnaudb.json
[16:26:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69166 and previous config saved to /var/cache/conftool/dbconfig/20240916-162638-arnaudb.json
[16:26:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69167 and previous config saved to /var/cache/conftool/dbconfig/20240916-162644-arnaudb.json
[16:26:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 16%: T374623', diff saved to https://phabricator.wikimedia.org/P69168 and previous config saved to /var/cache/conftool/dbconfig/20240916-162649-arnaudb.json
[16:27:16] <dancy>	 herron: Just to make sure I understand what you said on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073253, all calls to opensearch-dashboards-plugin should pass `--allow-root` ?
[16:27:41] <rzl>	 sukhe: I'm just picking up context, is that https://phabricator.wikimedia.org/T373527?
[16:27:47] <swfrench-wmf>	 looking at puppet failures, seeing a lot of connection issues to puppetserver1003
[16:27:58] <herron>	 dancy: yes afaik, based on the removes we ran just today.  remove errored out at first without the flag
[16:28:09] <dancy>	 ok will do
[16:28:20] <herron>	 dancy: ty!
[16:28:39] <sukhe>	 rzl: swfrench-wmf: seeing both 1002 and 1003 and also noticed that puppet is disabled on these hosts
[16:28:42] <sukhe>	 jhathaway: ^
[16:28:52] <herron>	 dancy: wait no I'm wrong 
[16:28:58] <sukhe>	 rzl: I don't think it is thrashing in this case though but I can be wrong
[16:29:03] * dancy waits.
[16:29:16] <swfrench-wmf>	 sukhe: ah, interesting - ack
[16:29:22] <jhathaway>	 just restarted the puppetservers, to test strict variables, errors should recover, but if they don't I will revert
[16:29:37] <rzl>	 thanks jhathaway 
[16:29:41] <jhathaway>	 thanks sukhe for the ping
[16:29:44] <sukhe>	 <3
[16:29:48] <swfrench-wmf>	 awesome
[16:30:35] <herron>	 dancy: yeah sorry about that the sudo isn't running as root so nevermind me!
[16:30:40] <dancy>	 ah, makes sense.
[16:31:12] <dancy>	 herron: Please add your +1 if you're cool w/ the change
[16:32:36] <herron>	 dancy: +1'd!
[16:33:57] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[16:34:11] <dancy>	 Thanks!
[16:35:33] <icinga-wm>	 RECOVERY - Host gerrit1004 is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms
[16:36:03] <rzl>	 jhathaway: I'm interested and a little unsettled that it was 1002 and 1003 at the same time this time
[16:37:17] <jhathaway>	 rzl: sorry haven't fully grokked the back log, what happened at the same time?
[16:38:15] <rzl>	 just that both puppetserver1002 and 1003 started thrashing at around the same time
[16:38:25] <rzl>	 where previously we'd seen it for individual hosts AIUI
[16:38:36] <rzl>	 not sure if that makes it a coincidence or something cascadey
[16:39:09] <rzl>	 (no need to dig into the phatality stuff in the backlog, it's causally unrelated)
[16:39:47] <jhathaway>	 I restarted puppetserver on 1002 & 1003 at around the same time, perhaps a diff of 20secs
[16:40:23] <rzl>	 *oh* I misunderstood, I thought you restarted them because of the errors
[16:40:36] <rzl>	 but no, we were just getting transient errors because they were mid-restart
[16:40:42] <rzl>	 never mind that comment then :)
[16:41:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69169 and previous config saved to /var/cache/conftool/dbconfig/20240916-164124-arnaudb.json
[16:41:29] <stashbot>	 T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623
[16:41:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69170 and previous config saved to /var/cache/conftool/dbconfig/20240916-164129-arnaudb.json
[16:41:32] <jhathaway>	 no prob, I shouldn't have announced more widely, I didn't know a restart would generate that many failures, seems like their should be a more graceful method
[16:41:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69171 and previous config saved to /var/cache/conftool/dbconfig/20240916-164134-arnaudb.json
[16:41:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69172 and previous config saved to /var/cache/conftool/dbconfig/20240916-164139-arnaudb.json
[16:41:40] <rzl>	 okay in that case I'm going to go ahead and merge dancy's patch wrt the previous outage
[16:41:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69173 and previous config saved to /var/cache/conftool/dbconfig/20240916-164144-arnaudb.json
[16:41:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69174 and previous config saved to /var/cache/conftool/dbconfig/20240916-164149-arnaudb.json
[16:41:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 25%: T374623', diff saved to https://phabricator.wikimedia.org/P69175 and previous config saved to /var/cache/conftool/dbconfig/20240916-164154-arnaudb.json
[16:41:57] <icinga-wm>	 PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100%
[16:43:40] <mutante>	 I am not sure yet why it's down but that is NOT the production gerrit, it's in setup
[16:44:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:45:58] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[16:48:26] <dancy>	 rzl: I'm going to take a break for a bit.  Can I schedule a time with you to retry the prior deployment?  11am pacific?
[16:48:48] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[16:49:02] <rzl>	 dancy: sure, works for me -- I also just ran puppet on logstash1032, still depooled, so you can test there at will
[16:49:13] <dancy>	 ah good I'll try that right now
[16:54:26] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.102.0" for 211 hosts
[16:54:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[16:56:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69176 and previous config saved to /var/cache/conftool/dbconfig/20240916-165630-arnaudb.json
[16:56:35] <stashbot>	 T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623
[16:56:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69177 and previous config saved to /var/cache/conftool/dbconfig/20240916-165635-arnaudb.json
[16:56:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69178 and previous config saved to /var/cache/conftool/dbconfig/20240916-165640-arnaudb.json
[16:56:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69179 and previous config saved to /var/cache/conftool/dbconfig/20240916-165645-arnaudb.json
[16:56:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69180 and previous config saved to /var/cache/conftool/dbconfig/20240916-165650-arnaudb.json
[16:56:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69181 and previous config saved to /var/cache/conftool/dbconfig/20240916-165655-arnaudb.json
[16:57:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 50%: T374623', diff saved to https://phabricator.wikimedia.org/P69182 and previous config saved to /var/cache/conftool/dbconfig/20240916-165700-arnaudb.json
[16:58:40] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.102.0" completed for 211 hosts
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1700)
[17:00:04] <jouncebot>	 ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T1700).
[17:03:10] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.101.3" for 211 hosts
[17:03:29] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002"
[17:05:13] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[17:06:03] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.529 second response time https://wikitech.wikimedia.org/wiki/Docker
[17:08:03] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.101.3" for 1 hosts
[17:11:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69183 and previous config saved to /var/cache/conftool/dbconfig/20240916-171136-arnaudb.json
[17:11:40] <stashbot>	 T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623
[17:11:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69184 and previous config saved to /var/cache/conftool/dbconfig/20240916-171140-arnaudb.json
[17:11:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69185 and previous config saved to /var/cache/conftool/dbconfig/20240916-171146-arnaudb.json
[17:11:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69186 and previous config saved to /var/cache/conftool/dbconfig/20240916-171150-arnaudb.json
[17:11:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69187 and previous config saved to /var/cache/conftool/dbconfig/20240916-171155-arnaudb.json
[17:12:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69188 and previous config saved to /var/cache/conftool/dbconfig/20240916-171201-arnaudb.json
[17:12:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 75%: T374623', diff saved to https://phabricator.wikimedia.org/P69189 and previous config saved to /var/cache/conftool/dbconfig/20240916-171206-arnaudb.json
[17:12:55] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.101.3" for 2 hosts
[17:14:32] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.101.3" completed for 2 hosts
[17:16:22] <logmsgbot>	 !log dancy@deploy1003 Started deploy [releng/phatality@b1a2a70]: testing
[17:16:27] <logmsgbot>	 !log dancy@deploy1003 Finished deploy [releng/phatality@b1a2a70]: testing (duration: 00m 05s)
[17:17:17] <dancy>	 rzl: Initial testing on logstash1032 looks good.  I'll regroup with you at 11 for full deployment.
[17:21:40] <logmsgbot>	 !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@5ad6710]: (no justification provided)
[17:22:25] <logmsgbot>	 !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@5ad6710]: (no justification provided) (duration: 00m 44s)
[17:26:31] <rzl>	 dancy: sgtm, thanks!
[17:26:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69190 and previous config saved to /var/cache/conftool/dbconfig/20240916-172641-arnaudb.json
[17:26:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69191 and previous config saved to /var/cache/conftool/dbconfig/20240916-172646-arnaudb.json
[17:26:49] <stashbot>	 T374623: Decommission db21[21-40] - https://phabricator.wikimedia.org/T374623
[17:26:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69192 and previous config saved to /var/cache/conftool/dbconfig/20240916-172651-arnaudb.json
[17:26:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69193 and previous config saved to /var/cache/conftool/dbconfig/20240916-172656-arnaudb.json
[17:27:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69194 and previous config saved to /var/cache/conftool/dbconfig/20240916-172701-arnaudb.json
[17:27:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2237 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69195 and previous config saved to /var/cache/conftool/dbconfig/20240916-172706-arnaudb.json
[17:27:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 100%: T374623', diff saved to https://phabricator.wikimedia.org/P69196 and previous config saved to /var/cache/conftool/dbconfig/20240916-172712-arnaudb.json
[17:37:18] <mutante>	 so that host gerrit1004 that was alerting as host down.. that should not be in monitoring at all
[17:37:46] <mutante>	 the rename cookbook should have taken care of that PLUS we already manually ran a 'puppet node clean' when it was still in puppetdb
[17:37:55] <mutante>	 somehow it's still there.. not sure why
[17:38:12] <rzl>	 spooky
[17:38:40] <mutante>	 also ran the clean command on both puppetmaster and puppetserver.. so yea..
[17:39:48] <mutante>	 maybe I'll go to alert* and manually delete it from Icinga config and run puppet to see if it comes back or not
[17:39:49] <sukhe>	 mutante: seems like puppetdb still has the node
[17:39:54] <sukhe>	 https://puppetboard.wikimedia.org/node/gerrit1004.wikimedia.org
[17:40:38] <mutante>	 sukhe: but also the dates for catalog run are over a week ago
[17:40:46] <mutante>	 on that link
[17:41:02] <sukhe>	 yeah. but if you try say gerrit1005, it complains about the node not being there at all 
[17:41:07] <mutante>	 and when I checked for "how to delete from puppetdb" it said "node clean". right?
[17:41:08] <sukhe>	 so it probably is still somewhere? 
[17:41:24] <mutante>	 would you know other ways to delete from the db?
[17:42:05] <mutante>	 this host was renamed with the rename cookbook so it seems like a bug
[17:42:11] <sukhe>	 not off-hand. but maybe we can look at what the decommission cookbook does?
[17:42:21] <mutante>	 true, let me do that
[17:42:59] <mutante>	 I thought I did that.. but not sure now
[17:43:08] <sukhe>	 puppet node clean 
[17:43:11] <sukhe>	 puppet node deactivate
[17:43:14] <sukhe>	 https://doc.wikimedia.org/spicerack/v8.8.0/_modules/spicerack/puppet.html#PuppetServer.delete
[17:43:29] <sukhe>	 also apparently, both on server and master
[17:43:32] <sukhe>	 https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/decommission.py
[17:43:35] <sukhe>	         puppet_master.delete(fqdn)
[17:43:36] <sukhe>	         puppet_server.delete(fqdn)
[17:43:39] <mutante>	 ah, good call to check the source
[17:43:41] <sukhe>	 which calls the spicerack function above
[17:43:44] <mutante>	 trying 
[17:43:54] <swfrench-wmf>	 mutante: is it possible that you ran the rename before https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1071588 was merged?
[17:44:14] <mutante>	 I did not run the rename, dcops did. but checking 
[17:44:36] <swfrench-wmf>	 (guessing by the dates on puppetboard, that sounds plausible)
[17:44:49] <mutante>	 swfrench-wmf: yes, it was before that fix :)
[17:45:05] <mutante>	 well, that is great, no need for a new bug report or wondering, yay
[17:45:08] <mutante>	 thanks
[17:45:55] <mutante>	 Submitted 'deactivate node' for gerrit1004.wikimedia.org with UUID 7fb4f744-f07d-448e-ad5c-37539b1f334c
[17:46:13] <swfrench-wmf>	 no problem - all thanks goes to c.laime for finding and fixing that (all the renames we've been doing for wikikube workers shook out a lot of interesting things)
[17:46:19] <sukhe>	 mutante: nice!
[17:46:31] <mutante>	 so I had done "clean" but not "deactivate" and now done on both master and server
[17:46:37] <mutante>	 thanks all
[17:46:48] <mutante>	 checking icinga
[17:49:35] <mutante>	 yea, I can see puppet removing the icinga config snippets
[17:51:29] <wikibugs>	 (03CR) 10Jgreen: [C:03+2] frack: remove fraban2001 from dns for decommissioning [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) (owner: 10Dwisehaupt)
[17:51:32] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10150402 (10Dzahn) >>! In T372817#10129107, @MoritzMuehlenhoff wrote: > @Dzahn gerrit1004 is still in puppetdb: https://puppetboard.wikimedia.org/...
[17:52:02] <wikibugs>	 (03PS2) 10Dwisehaupt: frack: remove fraban2001 from dns for decommissioning [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741)
[17:52:49] <wikibugs>	 (03CR) 10Dzahn: "thanks for this. also ran into it with a renamed host and was wondering for a bit." [cookbooks] - 10https://gerrit.wikimedia.org/r/1071588 (https://phabricator.wikimedia.org/T374351) (owner: 10Clément Goubert)
[17:53:53] <wikibugs>	 (03Abandoned) 10Dzahn: network: introduce a list of friendly networks [puppet] - 10https://gerrit.wikimedia.org/r/1069387 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[17:55:17] <wikibugs>	 (03PS3) 10Dwisehaupt: frack: remove frban2001 from dns for decommissioning [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741)
[17:56:48] <wikibugs>	 (03CR) 10Dwisehaupt: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) (owner: 10Dwisehaupt)
[18:01:10] <rzl>	 dancy: standing by, no rush
[18:02:51] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: decommission frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374741#10150411 (10Dwisehaupt) a:05Dwisehaupt→03None
[18:05:17] <dancy>	 rzl: Retrying
[18:05:45] <logmsgbot>	 !log dancy@deploy1003 Started deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836
[18:06:31] <logmsgbot>	 !log dancy@deploy1003 Finished deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 (duration: 00m 46s)
[18:06:53] <rzl>	 looking good so far
[18:06:59] <logmsgbot>	 !log dancy@deploy1003 Started deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836
[18:07:04] <logmsgbot>	 !log dancy@deploy1003 Finished deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 (duration: 00m 04s)
[18:07:20] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, nice catch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French)
[18:08:29] <dancy>	 https://www.irccloud.com/pastebin/QO2CnBa3/
[18:09:11] <wikibugs>	 (03CR) 10Ssingh: "One question: you should not have been hitting dns2006 when it was unreachable during this period. It was depooled for all services, so sh" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French)
[18:09:17] <dancy>	 Seems like `restart_dashboards` should not run if `install_zip` fails..  I'll file a ticket for that issue.
[18:09:48] <rzl>	 oh, makes sense
[18:10:05] <rzl>	 I saw the service restart and figured that was good news :P
[18:10:29] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "This is outdated since meanwhile we switched from counting packets to counting connections." [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn)
[18:11:58] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "yep, waiting for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072690" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[18:12:17] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072690 first" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[18:12:34] <wikibugs>	 (03CR) 10Ssingh: "Thanks! Is it fine to abandon this then?" [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn)
[18:13:02] <wikibugs>	 (03PS13) 10Dzahn: phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677)
[18:13:18] <wikibugs>	 (03CR) 10Dzahn: "need to scheduled a downtime for the needed reboot" [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[18:14:31] <wikibugs>	 (03CR) 10Ssingh: "https://sal.toolforge.org/log/16DX5pEBFFSCpsJztaYX the exact time when it was depooled if it helps! (Since I am not sure when the cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French)
[18:16:40] <wikibugs>	 (03PS1) 10Volans: mysql_legacy: small fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1073274
[18:16:47] <wikibugs>	 (03CR) 10Volans: [C:03+2] mysql_legacy: instance improvements (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[18:17:17] <wikibugs>	 (03PS9) 10Volans: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351)
[18:18:43] <wikibugs>	 (03CR) 10Volans: "Replies inline, CI failure is because the change in spicerack has not yet been released." [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[18:19:59] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "per IRC chat: I will amend" [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn)
[18:21:12] <dancy>	 rzl: Can you `chown -R opensearch-dashboards: /usr/share/opensearch-dashboards/plugins/phatality` on all the logstash hosts?
[18:22:05] <rzl>	 yep
[18:22:34] <rzl>	 that won't get put back by puppet or anything?
[18:24:14] <dancy>	 I think it's the repair operations that were run today that caused them to be owned by root
[18:24:27] <rzl>	 ah got it
[18:24:45] <rzl>	 !log rzl@cumin1002:~$ sudo cumin O:logging::opensearch::collector 'chown -R opensearch-dashboards: /usr/share/opensearch-dashboards/plugins/phatality'
[18:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:24] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review, Riccardo!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French)
[18:27:39] <logmsgbot>	 !log dancy@deploy1003 Started deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836
[18:27:43] <logmsgbot>	 !log dancy@deploy1003 Finished deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 (duration: 00m 04s)
[18:27:51] <logmsgbot>	 !log dancy@deploy1003 Started deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836
[18:27:57] <logmsgbot>	 !log dancy@deploy1003 Finished deploy [releng/phatality@c2cb594]: Deploying https://gerrit.wikimedia.org/r/c/releng/phatality/+/1071836 (duration: 00m 06s)
[18:28:06] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989#10150478 (10CDanis) Today we saw another good use case for `sudo_pair`: while troubleshooting and firefighting a #phatality deploy gone wrong (T374880), several...
[18:28:19] <dancy>	 OK.. New stuff should be fully deployed now.. No errors.  Thanks a lot rzl and herron!  
[18:28:33] <rzl>	 sweet
[18:28:39] <rzl>	 thanks for the quick response dancy 
[18:29:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[18:30:12] <wikibugs>	 (03PS3) 10Dzahn: durum: include throttling class, enable it on durum2001, accept/log only [puppet] - 10https://gerrit.wikimedia.org/r/1059156
[18:30:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] durum: include throttling class, enable it on durum2001, accept/log only [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn)
[18:31:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10150492 (10phaultfinder)
[18:31:50] <wikibugs>	 (03PS1) 10Jdlrobson: Deploy Vector 2022 on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255)
[18:32:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255) (owner: 10Jdlrobson)
[18:33:31] <wikibugs>	 (03PS4) 10Dzahn: durum: include throttling class, enable it on durum2001, accept/log only [puppet] - 10https://gerrit.wikimedia.org/r/1059156
[18:36:13] <wikibugs>	 (03CR) 10Ssingh: "Ah thank you. Yeah, I guess we should update this, given:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French)
[18:41:38] <wikibugs>	 (03CR) 10Ssingh: "So I think this is what it should look like (volans is in CC and can comment if he thinks this make sense):" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French)
[18:42:32] <wikibugs>	 (03PS2) 10Jdlrobson: Deploy Vector 2022 on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255)
[18:47:32] <mutante>	 rzl: dancy: on -observability channel there were alerts for the dashboards / logstash hosts. some resolved, some not yet
[18:49:06] <wikibugs>	 (03PS1) 10Jdlrobson: Disable quick surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743)
[18:49:22] <rzl>	 mutante: they all look resolved to me, which ones do you see still open?
[18:49:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743) (owner: 10Jdlrobson)
[18:49:53] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis (route NDA requests to Rachel until September 27) - https://phabricator.wikimedia.org/T374673#10150557 (10Dzahn)
[18:51:35] <mutante>	 rzl: ehm, no you are right, they all resolved now. I was just confused by the order of alerts and the 159 :p other active alerts :)
[18:51:43] <rzl>	 👍
[19:02:46] <wikibugs>	 (03CR) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh)
[19:03:00] <wikibugs>	 (03PS4) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042
[19:10:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis (route NDA requests to Rachel until September 27) - https://phabricator.wikimedia.org/T374673#10150590 (10Dzahn)
[19:10:28] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis (route NDA requests to Rachel until September 30) - https://phabricator.wikimedia.org/T374673#10150591 (10Dzahn)
[19:14:21] <wikibugs>	 (03PS1) 10JHathaway: k8s::kubelet: fix deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1073281
[19:15:59] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073281 (owner: 10JHathaway)
[19:18:15] <wikibugs>	 (03PS1) 10AOkoth: vrts: change primary host [puppet] - 10https://gerrit.wikimedia.org/r/1073283 (https://phabricator.wikimedia.org/T373420)
[19:29:14] <wikibugs>	 (03PS3) 10Jdlrobson: Deploy Vector 2022 on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255)
[19:29:14] <wikibugs>	 (03PS2) 10Jdlrobson: Disable quick surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743)
[19:32:41] <wikibugs>	 (03PS2) 10JHathaway: k8s::kubelet: fix deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1073281
[19:32:48] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073281 (owner: 10JHathaway)
[19:35:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10150678 (10phaultfinder)
[19:53:13] <wikibugs>	 (03PS1) 10JHathaway: puppet8: enable strict mode [puppet] - 10https://gerrit.wikimedia.org/r/1073284 (https://phabricator.wikimedia.org/T372664)
[19:53:29] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073284 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T2000). nyaa~
[20:00:04] <jouncebot>	 Krinkle and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10150757 (10phaultfinder)
[20:02:30] <Jdlrobson>	 o/
[20:04:48] <wikibugs>	 (03PS4) 10Jdlrobson: Deploy Vector 2022 on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255)
[20:04:48] <wikibugs>	 (03PS3) 10Jdlrobson: Disable quick surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743)
[20:06:49] <Krinkle>	 Go ahead, I might do mine later. need to be afk for a bit
[20:14:39] <toyofuku>	 We're gonna do a quick deploy of Jdlrobson's two patches
[20:15:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255) (owner: 10Jdlrobson)
[20:15:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743) (owner: 10Jdlrobson)
[20:16:02] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Vector 2022 on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073277 (https://phabricator.wikimedia.org/T374255) (owner: 10Jdlrobson)
[20:16:06] <wikibugs>	 (03Merged) 10jenkins-bot: Disable quick surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073279 (https://phabricator.wikimedia.org/T374743) (owner: 10Jdlrobson)
[20:16:19] <logmsgbot>	 !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1073277|Deploy Vector 2022 on small wikis (T374255)]], [[gerrit:1073279|Disable quick surveys (T374743)]]
[20:16:24] <stashbot>	 T374255: Deploy Vector 2022 on small wikis - https://phabricator.wikimedia.org/T374255
[20:16:24] <stashbot>	 T374743: Disable quick surveys for experiments - https://phabricator.wikimedia.org/T374743
[20:17:06] <toyofuku>	 I'm eating some really good leftover sushi - it's a spicy tuna roll with albacore on top and a bit of garlic butter
[20:17:20] <toyofuku>	 In case anyone was wondering
[20:18:14] <dancy>	 haha
[20:18:19] <dancy>	 Sounds great
[20:18:34] <logmsgbot>	 !log toyofuku@deploy1003 jdlrobson, toyofuku: Backport for [[gerrit:1073277|Deploy Vector 2022 on small wikis (T374255)]], [[gerrit:1073279|Disable quick surveys (T374743)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:18:43] <toyofuku>	 Jdlrobson: ready for you to test!
[20:19:10] <Jdlrobson>	 on it
[20:20:18] <Jdlrobson>	 @toyofuku good to sync!
[20:20:23] <toyofuku>	 yeet
[20:20:26] <logmsgbot>	 !log toyofuku@deploy1003 jdlrobson, toyofuku: Continuing with sync
[20:21:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[20:24:51] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[20:25:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10150822 (10phaultfinder)
[20:25:12] <logmsgbot>	 !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073277|Deploy Vector 2022 on small wikis (T374255)]], [[gerrit:1073279|Disable quick surveys (T374743)]] (duration: 08m 53s)
[20:25:18] <stashbot>	 T374255: Deploy Vector 2022 on small wikis - https://phabricator.wikimedia.org/T374255
[20:25:18] <stashbot>	 T374743: Disable quick surveys for experiments - https://phabricator.wikimedia.org/T374743
[20:25:23] <toyofuku>	 All done!
[20:25:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[20:25:26] <toyofuku>	 Thank you everyone
[20:27:21] <rzl>	 toyofuku: thank you for the garlic butter inspiration, that never would have occurred to me
[20:28:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add  frack new switches - pt1979@cumin2002"
[20:28:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10150834 (10Papaul)
[20:28:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add  frack new switches - pt1979@cumin2002"
[20:28:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:42:02] <wikibugs>	 (03PS2) 10JHathaway: puppet8: enable strict mode [puppet] - 10https://gerrit.wikimedia.org/r/1073284 (https://phabricator.wikimedia.org/T372664)
[20:42:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[20:42:22] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073284 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[20:43:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10150882 (10phaultfinder)
[20:47:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add  frack new switches - pt1979@cumin2002"
[20:47:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add  frack new switches - pt1979@cumin2002"
[20:47:36] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:50:09] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] puppet8: enable strict mode [puppet] - 10https://gerrit.wikimedia.org/r/1073284 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[20:53:24] <jhathaway>	 !log reloading puppetserver to enable strict mode
[20:53:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[20:59:58] <wikibugs>	 (03PS1) 10Btullis: Move the misc_crons dumper role from snapshot1017 to snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1073289 (https://phabricator.wikimedia.org/T366555)
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240916T2100).
[21:01:20] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4000/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073289 (https://phabricator.wikimedia.org/T366555) (owner: 10Btullis)
[21:05:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:05:29] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "I haven't tried this technique of role switching before, but I'm hoping it will allow us to reboot snapshot1017 without interrupting the m" [puppet] - 10https://gerrit.wikimedia.org/r/1073289 (https://phabricator.wikimedia.org/T366555) (owner: 10Btullis)
[21:09:46] <wikibugs>	 (03PS1) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891)
[21:10:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:19:09] <wikibugs>	 (03CR) 10Scott French: "Thanks for the pointer to where something similar has been done elsewhere, @ssingh@wikimedia.org!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French)
[21:23:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:24:33] <wikibugs>	 (03PS1) 10Scott French: sre.switchdc.mediawiki: show TTL sleep end time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047)
[21:28:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:34:18] <wikibugs>	 (03PS2) 10Scott French: sre.switchdc.mediawiki: show TTL sleep end time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047)
[21:38:13] <wikibugs>	 (03CR) 10Scott French: "@effie@wikimedia.org, this is follow up from our discussion during the live-test earlier today." [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French)
[21:47:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.switchdc.mediawiki: show TTL sleep end time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French)
[21:55:05] <wikibugs>	 (03PS1) 10JHathaway: mydumper: rename metaparam [puppet] - 10https://gerrit.wikimedia.org/r/1073292
[21:57:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10151015 (10jhathaway) 05Open→03Resolved a:03jhathaway enabled in production, closing
[21:57:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mydumper: rename metaparam [puppet] - 10https://gerrit.wikimedia.org/r/1073292 (owner: 10JHathaway)
[21:57:57] <wikibugs>	 (03PS3) 10Scott French: sre.switchdc.mediawiki: show TTL sleep end time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073291 (https://phabricator.wikimedia.org/T374047)
[22:30:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10151056 (10phaultfinder)
[22:49:46] <Dreamy_Jazz>	 !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration
[22:49:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10151170 (10phaultfinder)
[23:16:24] <wikibugs>	 (03PS1) 10Stoyofuku-wmf: Deploy new donate link location to pilot wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073297 (https://phabricator.wikimedia.org/T373585)
[23:16:49] <wikibugs>	 (03PS1) 10Stoyofuku-wmf: Deploy new donate link location to pilot wikis (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073297 (https://phabricator.wikimedia.org/T373585)
[23:17:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073297 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf)
[23:25:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10151188 (10phaultfinder)
[23:27:13] <wikibugs>	 (03CR) 10Bartosz Dziewoński: Improve $wgFooterIcons override, remove $wmgWikimediaIcon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński)
[23:27:19] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Improve $wgFooterIcons override, simplify $wmgWikimediaIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712
[23:38:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073300
[23:38:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073300 (owner: 10TrainBranchBot)
[23:58:23] <wikibugs>	 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T374897 (10phaultfinder) 03NEW