[00:10:21] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073570 (owner: 10TrainBranchBot)
[00:11:29] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on gitlab2002.wikimedia.org with reason: version upgrade
[00:11:42] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on gitlab2002.wikimedia.org with reason: version upgrade
[00:54:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:54:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[00:55:09] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:24:23] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1003 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[01:25:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10155686 (10phaultfinder)
[01:53:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10155692 (10Papaul) I think replacing the pfw first will be a good idea since we are not changing any configuration on them but just the name and less...
[01:56:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10155693 (10Papaul)
[01:57:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10155694 (10Papaul)
[02:13:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10155696 (10Papaul) I update the diagram again since we will not be using VC.   {F57520229}
[02:16:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10155698 (10Papaul) While working on setting up the new fasw2-c8-codfw I realized that  fpc0 has interface ge-0/0/47 connected to fmsw-c8-codfw...
[02:43:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:13:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:05:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:15:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:54:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[05:12:09] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1073587 (https://phabricator.wikimedia.org/T375047)
[05:13:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:13:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:23:58] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s6 T375047
[05:24:03] <stashbot>	 T375047: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T375047
[05:24:39] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s6 T375047
[05:24:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2214 with weight 0 T375047', diff saved to https://phabricator.wikimedia.org/P69240 and previous config saved to /var/cache/conftool/dbconfig/20240918-052446-arnaudb.json
[05:29:21] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1073587 (https://phabricator.wikimedia.org/T375047) (owner: 10Gerrit maintenance bot)
[05:30:34] <arnaudb>	 !log Starting s6 codfw failover from db2129 to db2214 - T375047
[05:30:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:30:38] <stashbot>	 T375047: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T375047
[05:31:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2214 to s6 primary T375047', diff saved to https://phabricator.wikimedia.org/P69241 and previous config saved to /var/cache/conftool/dbconfig/20240918-053115-arnaudb.json
[05:33:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T375047', diff saved to https://phabricator.wikimedia.org/P69242 and previous config saved to /var/cache/conftool/dbconfig/20240918-053357-arnaudb.json
[05:36:21] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s4 T374804
[05:36:25] <stashbot>	 T374804: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T374804
[05:36:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2179 with weight 0 T374804', diff saved to https://phabricator.wikimedia.org/P69243 and previous config saved to /var/cache/conftool/dbconfig/20240918-053633-arnaudb.json
[05:36:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:37:13] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T374804
[05:38:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2179 from API/vslow/dump T374804', diff saved to https://phabricator.wikimedia.org/P69244 and previous config saved to /var/cache/conftool/dbconfig/20240918-053807-arnaudb.json
[05:39:45] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:39:49] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:42:45] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 9.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:42:45] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.731 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:43:13] <arnaudb>	 !log Starting s4 codfw failover from db2140 to db2179 - T374804
[05:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:17] <stashbot>	 T374804: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T374804
[05:43:31] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1073035 (https://phabricator.wikimedia.org/T374804) (owner: 10Gerrit maintenance bot)
[05:45:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2179 to s4 primary T374804', diff saved to https://phabricator.wikimedia.org/P69245 and previous config saved to /var/cache/conftool/dbconfig/20240918-054515-arnaudb.json
[05:47:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374804', diff saved to https://phabricator.wikimedia.org/P69246 and previous config saved to /var/cache/conftool/dbconfig/20240918-054729-arnaudb.json
[05:48:38] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T374807
[05:48:42] <stashbot>	 T374807: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T374807
[05:49:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2220 with weight 0 T374807', diff saved to https://phabricator.wikimedia.org/P69247 and previous config saved to /var/cache/conftool/dbconfig/20240918-054909-arnaudb.json
[05:49:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2220 from API/vslow/dump T374807', diff saved to https://phabricator.wikimedia.org/P69248 and previous config saved to /var/cache/conftool/dbconfig/20240918-054921-arnaudb.json
[05:49:22] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T374807
[06:01:38] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1073039 (https://phabricator.wikimedia.org/T374807) (owner: 10Gerrit maintenance bot)
[06:02:37] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:02:53] <arnaudb>	 !log Starting s7 codfw failover from db2218 to db2220 - T374807
[06:02:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:57] <stashbot>	 T374807: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T374807
[06:03:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2220 to s7 primary T374807', diff saved to https://phabricator.wikimedia.org/P69249 and previous config saved to /var/cache/conftool/dbconfig/20240918-060332-arnaudb.json
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:05:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374807', diff saved to https://phabricator.wikimedia.org/P69250 and previous config saved to /var/cache/conftool/dbconfig/20240918-060549-arnaudb.json
[06:07:10] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1073699 (https://phabricator.wikimedia.org/T375050)
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:12:39] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10155887 (10ABran-WMF) all needed switchover prior to tonight have been done. I'll run  T375050 as soon as this is done because circular r...
[06:39:14] <moritzm>	 !log installing curl security updates
[06:39:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:03] <wikibugs>	 (03PS6) 10Gmodena: ds8-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787)
[06:44:47] <wikibugs>	 (03CR) 10Gmodena: ds8-k8s-service: add values for dumps2 job. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena)
[06:50:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch the deployment role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073704 (https://phabricator.wikimedia.org/T349619)
[06:53:33] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS7195/IPv4: Connect - EdgeUno, AS7195/IPv6: Connect - EdgeUno https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:07:27] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:12:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:12:27] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "See my comment at T359795#10148316  , the manual `update-alternatives` would be overridden by the next Puppet run.  But overall I think it" [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[07:12:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[07:12:43] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10155908 (10Vgutierrez) Answering here @RobH question: >Hey I made some assumptions on the cp hosts troubleshooting but should check with you: Those hosts are under the same weight conditions as al...
[07:13:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:17:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10155921 (10MoritzMuehlenhoff)
[07:29:41] <icinga-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 645, down: 83, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:30:21] <wikibugs>	 (03CR) 10Muehlenhoff: "profile::java takes care of setting the alternative as well, since L32 in "class java", the default JRE/JDK is the first Java version defi" [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[07:31:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10155956 (10MoritzMuehlenhoff)
[07:32:41] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS7195/IPv4: Connect - EdgeUno, AS7195/IPv6: Connect - EdgeUno https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:33:04] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.dns.netbox
[07:35:39] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "There is `class { 'httpd': }` defined above which does an `ensure_packages('apache2')` and should thus install the `apache2` package befor" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[07:35:48] <wikibugs>	 (03CR) 10DCausse: flink-app: customize calico label selector (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[07:37:29] <icinga-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 680, down: 48, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:37:45] <elukey>	 jouncebot: next
[07:37:45] <jouncebot>	 In 0 hour(s) and 22 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T0800)
[07:38:16] <wikibugs>	 (03CR) 10DCausse: [C:03+1] Add a private variant of the cirrus update stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073565 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson)
[07:39:38] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10155971 (10cmooney) >>! In T373104#10147494, @Jelto wrote: > `gitlab-runner2004` is a special purpose runner, so if we depool the runner...
[07:40:04] <wikibugs>	 (03CR) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol)
[07:40:48] <wikibugs>	 (03CR) 10Elukey: "recheck" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey)
[07:40:57] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] debian: update the target distribution to bookworm-wikimedia [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey)
[07:42:01] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10155980 (10elukey) 05Open→03Resolved
[07:43:35] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fixed asset tag for db1179 - volans@cumin1002"
[07:44:03] <wikibugs>	 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10155983 (10elukey) I had a chat with Filippo, the keyholder-proxy is not the daemon that needs re-arming when restarted, so it can be done anytime withou...
[07:45:09] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fixed asset tag for db1179 - volans@cumin1002"
[07:45:09] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:48:35] <wikibugs>	 (03PS1) 10Volans: netbox: notify dcops for uncommitted DNS changes [puppet] - 10https://gerrit.wikimedia.org/r/1073732
[07:49:23] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1003 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[07:53:50] <wikibugs>	 (03PS2) 10Brouberol: cloudnative-pg-cluster: set sane defaults values for PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073392 (https://phabricator.wikimedia.org/T372278)
[07:55:36] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "we might need to change the wikidata maxlag propagation bits as well (https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikidata.org/+/0" [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper)
[07:56:37] <wikibugs>	 (03CR) 10DCausse: [C:03+1] wdqs max lag: target specific port [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper)
[07:56:40] <wikibugs>	 (03PS1) 10Elukey: role::puppetserver: add admin groups config [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023)
[07:58:25] <wikibugs>	 (03CR) 10Elukey: [C:04-1] "sigh https://puppet-compiler.wmflabs.org/output/1073733/4009/puppetserver1001.eqiad.wmnet/change.puppetserver1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[08:00:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10155999 (10cmooney) 05Open→03Resolved a:03cmooney
[08:00:49] <jnuche>	 good morning, train will rollout in a few minutes
[08:06:45] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073734 (https://phabricator.wikimedia.org/T373642)
[08:06:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073734 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot)
[08:07:33] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073734 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot)
[08:08:23] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] netbox: notify dcops for uncommitted DNS changes [puppet] - 10https://gerrit.wikimedia.org/r/1073732 (owner: 10Volans)
[08:09:42] <wikibugs>	 (03PS2) 10Stevemunene: hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788)
[08:09:48] <wikibugs>	 (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene)
[08:11:04] <wikibugs>	 (03CR) 10Muehlenhoff: role::puppetserver: add admin groups config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[08:11:50] <wikibugs>	 (03CR) 10Elukey: [C:04-1] "So profile::puppetserver::git defines sudo::user, that in turn creates /etc/sudoers.d. The same file is created by profile::admins -> sudo" [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[08:12:34] <wikibugs>	 (03CR) 10Muehlenhoff: role::puppetserver: add admin groups config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[08:13:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] corto: force directory removal [puppet] - 10https://gerrit.wikimedia.org/r/1073412 (owner: 10Filippo Giunchedi)
[08:14:49] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: remove old poolcounter netpolicies for Thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073164 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[08:14:55] <wikibugs>	 (03CR) 10Brouberol: ds8-k8s-service: add values for dumps2 job. (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena)
[08:15:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:15:55] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.23  refs T373642
[08:15:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet
[08:16:02] <stashbot>	 T373642: 1.43.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T373642
[08:16:15] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10156095 (10ops-monitoring-bot) Draining ganeti2017.codfw.wmnet of running VMs
[08:17:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073503 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[08:18:01] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073392 (https://phabricator.wikimedia.org/T372278) (owner: 10Brouberol)
[08:18:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073502 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[08:18:57] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073464 (owner: 10Brouberol)
[08:20:43] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:20:44] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:20:46] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:20:46] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:21:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet
[08:21:44] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:21:44] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:22:02] <jnuche>	 train needs to be rolled back
[08:22:11] <elukey>	 :(
[08:22:38] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073736 (https://phabricator.wikimedia.org/T373642)
[08:22:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073736 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot)
[08:22:47] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:23:24] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073736 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot)
[08:23:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073445 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[08:23:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:23:47] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:23:56] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: ensure each airflow release store logs to a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073445 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[08:24:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] cloudnative-pg: grant the deploy user the ability to create manual backups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073464 (owner: 10Brouberol)
[08:24:43] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "And in Puppet state files, the `apache2` install is ordered after `/var/www/robots.txt`:" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[08:24:43] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:24:44] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:24:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:24:47] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:25:18] <wikibugs>	 (03CR) 10Hashar: [C:03+1] Switch the deployment role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073704 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:25:21] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[08:25:30] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[08:25:43] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:26:43] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:26:44] <wikibugs>	 (03CR) 10Hashar: [C:03+1] deployment servers: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff)
[08:28:04] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Once applied, I can do a dummy deployment on a simple repository such as `integration/docroot` to validate everything still works :)" [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff)
[08:28:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch the deployment role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073704 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:28:37] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:29:43] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:30:23] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.23  refs T373642
[08:30:27] <stashbot>	 T373642: 1.43.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T373642
[08:30:44] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:31:43] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1039 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:31:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:32:13] <elukey>	 !log install openjdk-17-jdk on puppetserver1002 to get some useful tools like jmap - T373527
[08:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:17] <stashbot>	 T373527: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527
[08:32:54] <wikibugs>	 (03CR) 10Btullis: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol)
[08:33:24] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:33:24] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073392 (https://phabricator.wikimedia.org/T372278) (owner: 10Brouberol)
[08:33:44] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:33:48] <wikibugs>	 (03CR) 10Btullis: [C:03+1] cloudnative-pg: grant the deploy user the ability to create manual backups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073464 (owner: 10Brouberol)
[08:33:51] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10156154 (10MoritzMuehlenhoff)
[08:34:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: set sane defaults values for PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073392 (https://phabricator.wikimedia.org/T372278) (owner: 10Brouberol)
[08:35:38] <wikibugs>	 (03CR) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol)
[08:35:43] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:35:44] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1031 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:36:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060 (10Cyndymediawiksim) 03NEW
[08:38:47] <wikibugs>	 (03PS4) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938)
[08:39:00] <wikibugs>	 (03CR) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol)
[08:41:40] <tappof>	 !log centrallog2002 upgrade to bookworm in progress https://phabricator.wikimedia.org/T353912
[08:41:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:44] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1043 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:43:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:43:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:43:45] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:45:44] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1048 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:45:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:46:43] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:46:47] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:46:47] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:46:48] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:47:44] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:47:45] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:47:47] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:48:43] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:48:44] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:48:44] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:48:45] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:49:47] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:50:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10156237 (10DMburugu) I approve this
[08:50:37] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:50:38] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:50:45] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:50:47] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:50:48] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:50:48] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:44] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:44] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:44] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:45] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:46] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:47] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:48] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:49] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:50] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:51] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:52] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:53] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:54] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:55] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:56] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:57] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:58] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:51:59] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:52:00] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:52:44] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:52:45] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:52:45] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:52:50] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1065 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:52:51] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: enable Community Updates module in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073739 (https://phabricator.wikimedia.org/T374577)
[08:53:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:53:45] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:53:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:53:50] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1050 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:53:50] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:53:51] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1053 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:53:52] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:54:26] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:54:29] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:54:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:54:50] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:54:51] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1049 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:54:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[08:55:12] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10156264 (10Jelto) In `wikimedia-gitlab`, there have been some reports of failing jobs (cc...
[08:55:44] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:45] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:45] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:46] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:47] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:48] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:49] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:50] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:51] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:52] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:53] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:54] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:55:55] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:56:27] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:56:37] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:56:45] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1059 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:56:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:56:56] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "There are some outstanding issues but no blocker. The wait time, however, should be much smaller than 50 seconds. The original 10 seconds " [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[08:57:15] <wikibugs>	 (03Abandoned) 10Hashar: Read closed-labs as closed tag on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) (owner: 10Alex Monk)
[08:59:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:59:46] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1056 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:44] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1035 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:45] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1033 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:46] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:47] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1037 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:49] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:50] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1051 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:44] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1039 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:46] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:47] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1042 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:48] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:49] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:52] <moritzm>	 !log drain ganeti2026 T373104
[09:01:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:56] <stashbot>	 T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104
[09:02:43] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:02:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:02:46] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:02:46] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:03:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:03:45] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1044 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:03:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:03:46] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1041 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:04:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mtail in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:05:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:05:46] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1048 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:05:49] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:05:50] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1052 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:06:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:06:45] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1058 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:06:46] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1054 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:06:49] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:07:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:07:46] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1061 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job mtail in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:09:26] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:29] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:37] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:38] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:10:48] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4010/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073740 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[09:10:50] <wikibugs>	 (03Abandoned) 10Hashar: Increase the url shortener url size limit from 2k to 5k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617843 (https://phabricator.wikimedia.org/T220703) (owner: 10Ladsgroup)
[09:11:16] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet
[09:11:37] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye
[09:13:37] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:13:57] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:13:58] <wikibugs>	 (03Abandoned) 10Hashar: Enable ORES on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489936 (https://phabricator.wikimedia.org/T215354) (owner: 10Catrope)
[09:14:01] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:14:01] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:15:06] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[09:18:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 42s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:19:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:20:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10156310 (10phaultfinder)
[09:20:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:23:16] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 26s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:24:13] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job mtail in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:25:57] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 376, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:26:01] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:26:01] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:26:27] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet
[09:26:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10156328 (10elukey) I tried to generate a heap dump with jmap but it is very large and I'd need to copy it to my local laptop to inspect it via VisualVM. There is...
[09:26:36] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:28:16] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 20s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:30:07] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol)
[09:30:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 16.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:30:32] <wikibugs>	 (03CR) 10Gmodena: changeprop: Enable PCS pregeneration without restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[09:32:02] <wikibugs>	 (03PS1) 10David Caro: prometheus::cloud: increase ceph scrape timeout [puppet] - 10https://gerrit.wikimedia.org/r/1073744
[09:32:21] <elukey>	 jouncebot: next
[09:32:21] <jouncebot>	 In 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1000)
[09:33:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1073744 (owner: 10David Caro)
[09:34:47] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4011/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073744 (owner: 10David Caro)
[09:35:28] <wikibugs>	 (03CR) 10David Caro: [V:03+1 C:03+2] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1073744 (owner: 10David Caro)
[09:35:49] <Dreamy_Jazz>	 dhinus: Sorry for the slow reply. That is the associated puppet patch.
[09:37:18] <wikibugs>	 (03Abandoned) 10Hashar: Demo: how group permissions could look like [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738992 (owner: 10Ppchelko)
[09:37:30] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#10156356 (10Volans) p:05Triage→03Medium Thanks for the task. I think the main decision to make is how fresh the data needs to be. If we opt f...
[09:38:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:40:13] <wikibugs>	 (03CR) 10DCausse: [C:03+1] wdqs max lag: target specific port (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper)
[09:41:51] <wikibugs>	 (03PS7) 10Gmodena: ds8-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787)
[09:42:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 37.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:42:43] <wikibugs>	 (03CR) 10Gmodena: ds8-k8s-service: add values for dumps2 job. (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena)
[09:43:09] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol)
[09:43:43] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:44:10] <wikibugs>	 (03Merged) 10jenkins-bot: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol)
[09:44:15] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:44:19] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:46:01] <wikibugs>	 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10156378 (10jijiki) @Dreamy_Jazz I see these are SQL connection timeouts. While I dig into it, could you please let us know if that is impacting the iPoid (eg error rates, latency, or the sch...
[09:46:15] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:46:19] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:46:31] <dhinus>	 Dreamy_Jazz: the patch looks good to me and I can deploy it after it gets merged, but I'd like to have a review from Data Engineering & Data Persistence as well. leave it with me, I'll ping some people
[09:46:43] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:46:51] <Dreamy_Jazz>	 Thanks.
[09:47:07] <Dreamy_Jazz>	 dhinus: Is there any way to test it before it gets merged? I couldn't see a way to do that easily.
[09:47:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 35s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:47:47] <Dreamy_Jazz>	 I can use the components from the view to make an SQL query, but I'm not sure that is properly testing the change.
[09:48:12] <wikibugs>	 (03CR) 10Elukey: [C:03+1] netbox: notify dcops for uncommitted DNS changes [puppet] - 10https://gerrit.wikimedia.org/r/1073732 (owner: 10Volans)
[09:48:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 (owner: 10Slyngshede)
[09:48:55] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073732 (owner: 10Volans)
[09:50:45] <wikibugs>	 (03PS5) 10Slyngshede: Notify managers via email when new permission requests are made. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238
[09:50:55] <wikibugs>	 (03CR) 10Slyngshede: Notify managers via email when new permission requests are made. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 (owner: 10Slyngshede)
[09:50:59] <wikibugs>	 (03CR) 10Elukey: [C:03+1] icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 (owner: 10Tiziano Fogli)
[09:51:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1073532 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway)
[09:52:25] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2124.codfw.wmnet - https://phabricator.wikimedia.org/T374847#10156389 (10ABran-WMF) a:05ABran-WMF→03None
[09:52:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 (owner: 10Slyngshede)
[09:52:29] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Notify managers via email when new permission requests are made. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 (owner: 10Slyngshede)
[09:52:40] <wikibugs>	 (03PS2) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486)
[09:52:47] <dhinus>	 Dreamy_Jazz: you can run the view query manually on a wikireplica, e.g. in quarry. not a complete test but it will catch any obvious errors in the view definition
[09:53:15] <dhinus>	 Dreamy_Jazz: yes what you wrote basically, I started typing before reading your message :)
[09:53:32] <elukey>	 jnuche: o/ I am going to scap backport a mw-config change, just wanted to double check with you if it is ok 
[09:53:49] <Dreamy_Jazz>	 Cool. I did that on production and saw that I missed something
[09:53:56] <Dreamy_Jazz>	 Updated the patch to fix that.
[09:54:55] <wikibugs>	 (03Merged) 10jenkins-bot: Notify managers via email when new permission requests are made. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 (owner: 10Slyngshede)
[09:55:24] <Dreamy_Jazz>	 Tested the query again and it seems to be working now. Thanks for the advice.
[09:55:27] <dhinus>	 Dreamy_Jazz: great. pcc/test experimental is not useful here, so I think that test is all we can do, plus checking with some other db experts
[09:55:46] <Dreamy_Jazz>	 Sure. Would it be helpful to get a review from someone else on my team?
[09:56:15] <dhinus>	 one more pair of eyes won't hurt :)
[09:56:50] <wikibugs>	 (03CR) 10Volans: [C:03+2] netbox: notify dcops for uncommitted DNS changes [puppet] - 10https://gerrit.wikimedia.org/r/1073732 (owner: 10Volans)
[09:57:55] <wikibugs>	 06SRE: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066 (10fgiunchedi) 03NEW
[09:57:58] <wikibugs>	 (03PS1) 10Effie Mouzeli: ipoid: Set activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073748 (https://phabricator.wikimedia.org/T374414)
[09:58:10] <wikibugs>	 (03PS1) 10Tiziano Fogli: grafana: cluster name misc to grafana [puppet] - 10https://gerrit.wikimedia.org/r/1073749 (https://phabricator.wikimedia.org/T375066)
[09:58:26] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: ipoid: Set activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan)
[09:59:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 40s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:59:24] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "LGTM, but I'd like a +1 from Data Persistence and Data Engineering too. According to https://wikitech.wikimedia.org/wiki/Portal:Data_Servi" [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1000)
[10:02:05] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[10:02:23] <wikibugs>	 (03PS1) 10Volans: re.switchdc.databases.prepare: reduce wait time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351)
[10:02:30] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Check that throttling exceptions use valid public IP addresses (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE))
[10:02:44] <wikibugs>	 (03PS6) 10Lucas Werkmeister (WMDE): Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980)
[10:03:12] <jnuche>	 elukey: yeah, ok from my side
[10:03:32] <elukey>	 super thanks!
[10:04:53] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Check that throttling exceptions use valid public IP addresses (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE))
[10:05:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 10.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:05:44] <wikibugs>	 (03PS1) 10Elukey: role::puppetserver: set the maximum number of instances [puppet] - 10https://gerrit.wikimedia.org/r/1073751 (https://phabricator.wikimedia.org/T373527)
[10:05:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by elukey@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[10:06:39] <wikibugs>	 (03Merged) 10jenkins-bot: Swap poolcounter2004 with poolcounter2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[10:06:41] <wikibugs>	 (03PS2) 10Effie Mouzeli: ipoid: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073443 (https://phabricator.wikimedia.org/T356885)
[10:06:58] <wikibugs>	 (03PS2) 10Effie Mouzeli: ipoid: Set activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073748 (https://phabricator.wikimedia.org/T374414)
[10:07:14] <wikibugs>	 (03CR) 10Arnaudb: "totally optional comments, lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[10:07:16] <logmsgbot>	 !log elukey@deploy1003 Started scap sync-world: Backport for [[gerrit:1073427|Swap poolcounter2004 with poolcounter2006 (T332015)]]
[10:07:20] <stashbot>	 T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015
[10:08:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE))
[10:08:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE))
[10:08:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/Wikibase] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073479 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE))
[10:09:33] <wikibugs>	 (03CR) 10Hnowlan: changeprop: Enable PCS pregeneration without restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[10:09:34] <logmsgbot>	 !log elukey@deploy1003 elukey: Backport for [[gerrit:1073427|Swap poolcounter2004 with poolcounter2006 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:09:40] <logmsgbot>	 !log elukey@deploy1003 elukey: Continuing with sync
[10:09:58] <wikibugs>	 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10156425 (10Dreamy_Jazz) >>! In T375006#10156378, @jijiki wrote: > @Dreamy_Jazz I see these are SQL connection timeouts. While I dig into it, could you please let us know if that is impacting...
[10:10:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:13:21] <wikibugs>	 (03PS2) 10Volans: re.switchdc.databases.prepare: reduce wait time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351)
[10:14:01] <wikibugs>	 06SRE, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10156434 (10fgiunchedi)
[10:14:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] grafana: cluster name misc to grafana [puppet] - 10https://gerrit.wikimedia.org/r/1073749 (https://phabricator.wikimedia.org/T375066) (owner: 10Tiziano Fogli)
[10:14:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:14:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 0s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:14:24] <logmsgbot>	 !log elukey@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073427|Swap poolcounter2004 with poolcounter2006 (T332015)]] (duration: 07m 08s)
[10:14:29] <stashbot>	 T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015
[10:14:47] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] grafana: cluster name misc to grafana [puppet] - 10https://gerrit.wikimedia.org/r/1073749 (https://phabricator.wikimedia.org/T375066) (owner: 10Tiziano Fogli)
[10:18:16] <wikibugs>	 06SRE, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10156444 (10fgiunchedi)
[10:18:17] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 41.25s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:18:33] <icinga-wm>	 PROBLEM - poolcounter on poolcounter2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd https://www.mediawiki.org/wiki/PoolCounter
[10:18:49] <elukey>	 this is me -^
[10:19:02] <elukey>	 the host is not serving anything atm, but restarting poolcounterd failed
[10:19:05] <icinga-wm>	 PROBLEM - Poolcounter connection on poolcounter2003 is CRITICAL: connect to address 10.192.0.132 and port 7531: Connection refused https://www.mediawiki.org/wiki/PoolCounter
[10:19:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 8.333% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:20:08] <icinga-wm>	 RECOVERY - Poolcounter connection on poolcounter2003 is OK: TCP OK - 0.001 second response time on 10.192.0.132 port 7531 https://www.mediawiki.org/wiki/PoolCounter
[10:20:34] <icinga-wm>	 RECOVERY - poolcounter on poolcounter2003 is OK: PROCS OK: 1 process with command name poolcounterd https://www.mediawiki.org/wiki/PoolCounter
[10:20:34] <elukey>	 !log restart poolcounterd on poolcounter2003 (not serving any traffic atm, tried to clear old/stale conns)
[10:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:10] <wikibugs>	 06SRE, 10observability, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10156447 (10fgiunchedi)
[10:21:14] <wikibugs>	 06SRE, 10observability, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10156448 (10fgiunchedi)
[10:22:43] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] re.switchdc.databases.prepare: reduce wait time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[10:25:46] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye
[10:27:06] <wikibugs>	 (03PS1) 10Slyngshede: Context Processor: Check for signed in users before running processor. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073752
[10:27:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Sounds good, let's give it a shot. We'll refresh puppetserver2003 in the forthcoming quarter and we'll buy it with 128G instead of 64, so " [puppet] - 10https://gerrit.wikimedia.org/r/1073751 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey)
[10:27:56] <wikibugs>	 (03PS2) 10Slyngshede: Context Processor: Check for signed in users before running processor. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073752
[10:28:17] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 45s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:28:22] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:29:45] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:30:37] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:31:34] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Context Processor: Check for signed in users before running processor. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073752 (owner: 10Slyngshede)
[10:34:11] <wikibugs>	 (03Merged) 10jenkins-bot: Context Processor: Check for signed in users before running processor. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073752 (owner: 10Slyngshede)
[10:41:26] <wikibugs>	 (03PS3) 10Slyngshede: Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849
[10:44:48] <Dreamy_Jazz>	 jouncebot: nowandnext
[10:44:48] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1000)
[10:44:48] <jouncebot>	 In 0 hour(s) and 15 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1100)
[10:46:58] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:47:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:47:48] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:47:50] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] re.switchdc.databases.prepare: reduce wait time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[10:50:13] <wikibugs>	 (03PS1) 10Dreamy Jazz: Hooks: Re-order checks to verify that request user is same as Special:Contributions user [extensions/ContentTranslation] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073755 (https://phabricator.wikimedia.org/T375061)
[10:50:15] <wikibugs>	 (03CR) 10Volans: [C:03+2] re.switchdc.databases.prepare: reduce wait time (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[10:52:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/ContentTranslation] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073755 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz)
[10:52:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:53:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:54:52] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye
[10:55:09] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:56:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 53.03s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:58:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:00:05] <jouncebot>	 mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1100).
[11:01:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#10156519 (10MoritzMuehlenhoff)
[11:01:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 7.812s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:04:08] <wikibugs>	 (03Merged) 10jenkins-bot: re.switchdc.databases.prepare: reduce wait time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[11:07:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:08:17] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:09:23] <wikibugs>	 (03PS1) 10Volans: sre.switchdc.databases: fix Phabricator message [cookbooks] - 10https://gerrit.wikimedia.org/r/1073757 (https://phabricator.wikimedia.org/T371351)
[11:09:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10156547 (10MoritzMuehlenhoff) Great, many thanks! I'll rebuild the RAID and then I'll add the server back to active duty. Hopefully it works now for longer than a week :-)
[11:12:00] <wikibugs>	 (03PS2) 10Btullis: Add a cephosd cluster and assign it to the appropriate hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073434 (https://phabricator.wikimedia.org/T374932)
[11:12:46] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4012/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073434 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis)
[11:14:57] <Dreamy_Jazz>	 Anyone using this window?
[11:15:14] <Dreamy_Jazz>	 Would like to see if I can backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1073755 which should resolve a train blocker
[11:15:37] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[11:16:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073755 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz)
[11:16:15] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[11:16:29] <wikibugs>	 (03PS1) 10JMeybohm: Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366)
[11:16:38] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "https://phabricator.wikimedia.org/T374972#10156561" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073757 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[11:16:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Disable memcached ticket registry [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1073761 (https://phabricator.wikimedia.org/T367487)
[11:18:11] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1073761 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[11:18:54] <wikibugs>	 (03PS2) 10JMeybohm: Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366)
[11:23:14] <tchin>	 !log Deploying refinery
[11:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:40] <logmsgbot>	 !log tchin@deploy1003 Started deploy [analytics/refinery@bc0be94]: Regular analytics weekly train [analytics/refinery@bc0be94a]
[11:23:54] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Disable memcached ticket registry [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1073761 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[11:24:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:24:16] <wikibugs>	 (03CR) 10Volans: [C:03+2] sre.switchdc.databases: fix Phabricator message [cookbooks] - 10https://gerrit.wikimedia.org/r/1073757 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[11:27:03] <wikibugs>	 (03PS1) 10Volans: sre.switchdc.databses: test on test-s4 section [cookbooks] - 10https://gerrit.wikimedia.org/r/1073762
[11:27:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:27:57] <wikibugs>	 (03CR) 10Volans: [C:04-2] "DO NOT MERGE, just for testing purposed with test-cookbook for testing on test-s4 section" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073762 (owner: 10Volans)
[11:30:45] <wikibugs>	 (03PS2) 10Volans: sre.switchdc.databses: test on test-s4 section [cookbooks] - 10https://gerrit.wikimedia.org/r/1073762
[11:32:47] <logmsgbot>	 !log tchin@deploy1003 Finished deploy [analytics/refinery@bc0be94]: Regular analytics weekly train [analytics/refinery@bc0be94a] (duration: 09m 06s)
[11:33:18] <logmsgbot>	 !log tchin@deploy1003 Started deploy [analytics/refinery@bc0be94] (thin): Regular analytics weekly train THIN [analytics/refinery@bc0be94a]
[11:34:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:37:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:39:08] <logmsgbot>	 !log tchin@deploy1003 Finished deploy [analytics/refinery@bc0be94] (thin): Regular analytics weekly train THIN [analytics/refinery@bc0be94a] (duration: 05m 50s)
[11:39:32] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.databases: fix Phabricator message [cookbooks] - 10https://gerrit.wikimedia.org/r/1073757 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[11:39:34] <logmsgbot>	 !log tchin@deploy1003 Started deploy [analytics/refinery@bc0be94] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bc0be94a]
[11:43:31] <logmsgbot>	 !log tchin@deploy1003 Finished deploy [analytics/refinery@bc0be94] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bc0be94a] (duration: 03m 57s)
[11:43:51] <XioNoX>	 !log update pfw3-codfw dhcp-relay target 0 T375011
[11:43:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:12] <wikibugs>	 (03PS3) 10Anzx: Lift IP cap on 2024-10-07/08 for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073586 (https://phabricator.wikimedia.org/T374964)
[11:45:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.switchdc.databses: test on test-s4 section [cookbooks] - 10https://gerrit.wikimedia.org/r/1073762 (owner: 10Volans)
[11:45:59] <wikibugs>	 (03Merged) 10jenkins-bot: Hooks: Re-order checks to verify that request user is same as Special:Contributions user [extensions/ContentTranslation] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073755 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz)
[11:46:18] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073755|Hooks: Re-order checks to verify that request user is same as Special:Contributions user (T375061)]]
[11:46:22] <stashbot>	 T375061: InvalidArgumentException: Invalid username: <IP range> - https://phabricator.wikimedia.org/T375061
[11:47:16] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073586 (https://phabricator.wikimedia.org/T374964) (owner: 10Anzx)
[11:48:28] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1073755|Hooks: Re-order checks to verify that request user is same as Special:Contributions user (T375061)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:50:48] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[11:52:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:53:35] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: bypass mesh temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073770 (https://phabricator.wikimedia.org/T373517)
[11:53:39] <wikibugs>	 (03PS1) 10Dreamy Jazz: Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073771 (https://phabricator.wikimedia.org/T375061)
[11:53:58] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073771 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz)
[11:54:07] <wikibugs>	 (03PS1) 10Dreamy Jazz: Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073772 (https://phabricator.wikimedia.org/T375061)
[11:54:14] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073772 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz)
[11:54:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073770 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[11:55:09] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:55:21] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073755|Hooks: Re-order checks to verify that request user is same as Special:Contributions user (T375061)]] (duration: 09m 03s)
[11:55:25] <wikibugs>	 (03PS1) 10Brouberol: airflow: define an internal service name for the scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073773 (https://phabricator.wikimedia.org/T375072)
[11:55:25] <stashbot>	 T375061: InvalidArgumentException: Invalid username: <IP range> - https://phabricator.wikimedia.org/T375061
[11:55:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073772 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz)
[11:55:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073771 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz)
[12:02:22] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::idp::build: Readd rsync service [puppet] - 10https://gerrit.wikimedia.org/r/1073775 (https://phabricator.wikimedia.org/T367487)
[12:04:54] <wikibugs>	 (03Merged) 10jenkins-bot: Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073771 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz)
[12:04:56] <wikibugs>	 (03Merged) 10jenkins-bot: Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073772 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz)
[12:05:18] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073772|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]], [[gerrit:1073771|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]]
[12:05:22] <stashbot>	 T375061: InvalidArgumentException: Invalid username: <IP range> - https://phabricator.wikimedia.org/T375061
[12:07:35] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1073772|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]], [[gerrit:1073771|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:07:42] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[12:08:59] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1177.eqiad.wmnet with OS bullseye
[12:09:19] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add a cephosd cluster and assign it to the appropriate hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073434 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis)
[12:09:32] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol)
[12:10:04] <tchin>	 !log Deployed refinery using scap, then deployed onto hdfs
[12:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:37] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hiera: set cluster for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/1073776 (https://phabricator.wikimedia.org/T375066)
[12:12:19] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073772|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]], [[gerrit:1073771|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]] (duration: 07m 00s)
[12:12:24] <stashbot>	 T375061: InvalidArgumentException: Invalid username: <IP range> - https://phabricator.wikimedia.org/T375061
[12:12:31] <Dreamy_Jazz>	 Done my deploys for the train blocker
[12:12:58] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Validate port block speed combo in server provision script for QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930264 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[12:13:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:14:12] <wikibugs>	 (03CR) 10Brouberol: ds8-k8s-service: add values for dumps2 job. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena)
[12:14:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:14:50] <jnuche>	 Dreamy_Jazz: thank for deploying the fix!
[12:15:01] <Dreamy_Jazz>	 Np!
[12:15:05] <jnuche>	 I'll roll forward the train in ~5 mins
[12:15:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 1m 52s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:17:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529#10156770 (10cmooney) 05Open→03Resolved Validator is working well to prevent any mis-match, and automation is configuring things correc...
[12:18:13] <logmsgbot>	 !log tchin@deploy1003 Started deploy [airflow-dags/analytics@e6cc31a]: Regular analytics weekly train
[12:18:56] <logmsgbot>	 !log tchin@deploy1003 Finished deploy [airflow-dags/analytics@e6cc31a]: Regular analytics weekly train (duration: 01m 18s)
[12:19:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:19:57] <wikibugs>	 (03CR) 10Muehlenhoff: "Good catch! One comment inline, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm)
[12:20:16] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 1m 52s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:21:05] <wikibugs>	 (03CR) 10Gmodena: ds8-k8s-service: add values for dumps2 job. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena)
[12:21:09] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073777 (https://phabricator.wikimedia.org/T373642)
[12:21:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073777 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot)
[12:21:22] <logmsgbot>	 !log tchin@deploy1003 Started deploy [airflow-dags/analytics_test@e6cc31a]: Regular analytics weekly train
[12:21:39] <logmsgbot>	 !log tchin@deploy1003 Finished deploy [airflow-dags/analytics_test@e6cc31a]: Regular analytics weekly train (duration: 00m 20s)
[12:21:58] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073777 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot)
[12:23:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372) (owner: 10C. Scott Ananian)
[12:23:37] <wikibugs>	 (03PS3) 10C. Scott Ananian: Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372)
[12:28:55] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.23  refs T373642
[12:29:00] <stashbot>	 T373642: 1.43.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T373642
[12:30:04] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1073775 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[12:33:19] <moritzm>	 !log uploaded cas 7.0.4.1+wmf12u3 T367487
[12:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:25] <stashbot>	 T367487: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487
[12:34:42] <wikibugs>	 (03PS3) 10DCausse: wdqs categories: ship lastUpdated metric [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper)
[12:35:16] <wikibugs>	 (03CR) 10DCausse: "uploaded I828464daf76c9384545f2071963751effd5247cf and marked it as dependency" [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper)
[12:36:41] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10156844 (10MoritzMuehlenhoff) ganeti2017 and ganeti2026 are drained
[12:41:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:43:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] profile::idp::build: Readd rsync service [puppet] - 10https://gerrit.wikimedia.org/r/1073775 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[12:44:00] <vgutierrez>	 !log uploaded purged 0.23 to bullseye-wikimedia (apt.wm.o) - T334078
[12:44:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:05] <stashbot>	 T334078: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078
[12:44:22] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloud: codfw1dev: have a new bastion host in bastion-codfw1dev-04 [puppet] - 10https://gerrit.wikimedia.org/r/1073205 (https://phabricator.wikimedia.org/T374828)
[12:46:11] <vgutierrez>	 !log rolling upgrade to purged 0.23 in A:cp-ulsfo - T334078
[12:46:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 49.68s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:50:37] <wikibugs>	 (03PS4) 10Slyngshede: Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849
[12:51:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 10% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:52:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:53:32] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: define an internal service name for the scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073773 (https://phabricator.wikimedia.org/T375072) (owner: 10Brouberol)
[12:53:49] <wikibugs>	 (03PS1) 10Muehlenhoff: idp::build: Remove duplicate repository config [puppet] - 10https://gerrit.wikimedia.org/r/1073788
[12:54:00] <wikibugs>	 (03PS1) 10KartikMistry: Updated cxserver to 2024-09-18-104433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073789 (https://phabricator.wikimedia.org/T375017)
[12:54:09] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add a cephosd cluster and assign it to the appropriate hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073434 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis)
[12:54:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 16.25s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:54:55] <wikibugs>	 (03PS1) 10Dreamy Jazz: Hide temp account IP address viewing right from non-temp account wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073790 (https://phabricator.wikimedia.org/T369187)
[12:54:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[12:55:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover idp-test [dns] - 10https://gerrit.wikimedia.org/r/1073791
[12:55:21] <wikibugs>	 (03PS2) 10Dreamy Jazz: Hide temp account IP address viewing right from non-temp account wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073790 (https://phabricator.wikimedia.org/T369187)
[12:55:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073790 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz)
[12:56:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] "Sigh, good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm)
[12:56:09] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:56:44] <wikibugs>	 (03PS5) 10Slyngshede: Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849
[12:58:53] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good, forgot about that." [puppet] - 10https://gerrit.wikimedia.org/r/1073788 (owner: 10Muehlenhoff)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1300).
[13:00:05] <jouncebot>	 sergi0, Lucas_WMDE, Dreamy_Jazz, anzx, hnowlan, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <hnowlan>	 o/
[13:00:12] <Dreamy_Jazz>	 \o
[13:00:19] <sergi0>	 hi
[13:00:55] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::puppetserver: set the maximum number of instances [puppet] - 10https://gerrit.wikimedia.org/r/1073751 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey)
[13:01:37] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Failover idp-test [dns] - 10https://gerrit.wikimedia.org/r/1073791 (owner: 10Muehlenhoff)
[13:02:07] <hashar>	 hi, Lucas patch for "Check that throttling exceptions use valid public IP addresses" can be merged out of the window ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073487 )
[13:02:32] <elukey>	 jouncebot: next
[13:02:32] <jouncebot>	 In 0 hour(s) and 57 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1400)
[13:02:37] <wikibugs>	 (03CR) 10Cyndywikime: [C:03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073739 (https://phabricator.wikimedia.org/T374577) (owner: 10Sergio Gimeno)
[13:02:45] <hashar>	 and there are too many patches for this one hour window, so that is definitely going to be extended
[13:03:04] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] GrowthExperiments: enable Community Updates module in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073739 (https://phabricator.wikimedia.org/T374577) (owner: 10Sergio Gimeno)
[13:03:10] <Dreamy_Jazz>	 I can deploy
[13:03:25] <hnowlan>	 my change can't be tested on testservers and so can just go straight to prod
[13:03:44] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Hide temp account IP address viewing right from non-temp account wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073790 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz)
[13:03:45] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: enable Community Updates module in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073739 (https://phabricator.wikimedia.org/T374577) (owner: 10Sergio Gimeno)
[13:04:05] <Dreamy_Jazz>	 anzx: Are you here for the window?
[13:04:20] <Dreamy_Jazz>	 Lucas_WMDE: Do you want me to deploy your changes?
[13:04:30] <wikibugs>	 (03Merged) 10jenkins-bot: Hide temp account IP address viewing right from non-temp account wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073790 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz)
[13:04:42] <hashar>	 anzx change can be deployed as is, there is not much testing we can do for throttling :)
[13:05:03] <wikibugs>	 (03CR) 10Hashar: [C:03+1] Lift IP cap on 2024-10-07/08 for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073586 (https://phabricator.wikimedia.org/T374964) (owner: 10Anzx)
[13:05:25] <Dreamy_Jazz>	 We could also merge that test to make sure the new patch works :)
[13:05:47] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE))
[13:05:48] <hashar>	 well the test simply cover there are no private IP used 
[13:05:55] <Dreamy_Jazz>	 Sure.
[13:05:55] <anzx>	 Dreamy_Jazz: o/
[13:06:06] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Lift IP cap on 2024-10-07/08 for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073586 (https://phabricator.wikimedia.org/T374964) (owner: 10Anzx)
[13:06:07] <hashar>	 what I wonder is whether `scap backport` can deploy both changes at the same time
[13:06:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] shellbox-video: bypass mesh temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073770 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[13:06:28] <Dreamy_Jazz>	 I would have thought so, but I can test that.
[13:06:31] <wikibugs>	 (03Merged) 10jenkins-bot: Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE))
[13:06:45] <wikibugs>	 (03Merged) 10jenkins-bot: Lift IP cap on 2024-10-07/08 for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073586 (https://phabricator.wikimedia.org/T374964) (owner: 10Anzx)
[13:06:57] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] shellbox-video: bypass mesh temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073770 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[13:07:06] <cscott>	 i'm here
[13:07:23] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372) (owner: 10C. Scott Ananian)
[13:07:38] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: bypass mesh temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073770 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[13:07:44] <Dreamy_Jazz>	 Lucas_WMDE: Are you around for your wmf backports?
[13:08:05] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372) (owner: 10C. Scott Ananian)
[13:08:22] <wikibugs>	 (03CR) 10Ssingh: "This is ready for review from Traffic." [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh)
[13:08:37] <wikibugs>	 (03PS1) 10Klausman: aptrepo: Add ROCm61 component for ML-Labs machines [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076)
[13:08:43] <wikibugs>	 (03CR) 10Ssingh: "I mean from our perspective this is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh)
[13:09:21] <icinga-wm>	 PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:09:24] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073739|GrowthExperiments: enable Community Updates module in testwiki (T374577)]], [[gerrit:1073487|Check that throttling exceptions use valid public IP addresses (T374980)]], [[gerrit:1073790|Hide temp account IP address viewing right from non-temp account wikis (T369187)]], [[gerrit:1073586|Lift IP cap on 2024-10-07/08 for edit-a-thon (T374964)]]
[13:09:25] <logmsgbot>	 , [[gerrit:1073770|shellbox-video: bypass mesh temporarily (T373517)]], [[gerrit:1073541|Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage (T374372)]]
[13:09:29] <Dreamy_Jazz>	 As lucas hasn't said they are around, I'm going to proceed with all but the wmf backports
[13:09:33] <stashbot>	 T374577: Community Updates module: Release to Test Wikipedia - https://phabricator.wikimedia.org/T374577
[13:09:33] <stashbot>	 T374980: Enforce exclusion of private IP addresses from $wmgThrottlingExceptions in CI - https://phabricator.wikimedia.org/T374980
[13:09:33] <stashbot>	 T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187
[13:09:34] <stashbot>	 T374964: Lift IP cap on this dates 2024-10-07/08 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T374964
[13:09:34] <stashbot>	 T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517
[13:09:34] <stashbot>	 T374372: Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage (week of Sep 16) - https://phabricator.wikimedia.org/T374372
[13:09:49] <Dreamy_Jazz>	 They would take longer to merge, so we can always come back to them later
[13:11:53] <logmsgbot>	 !log dreamyjazz@deploy1003 sgimeno, anzx, lucaswerkmeister-wmde, cscott, hnowlan, dreamyjazz: Backport for [[gerrit:1073739|GrowthExperiments: enable Community Updates module in testwiki (T374577)]], [[gerrit:1073487|Check that throttling exceptions use valid public IP addresses (T374980)]], [[gerrit:1073790|Hide temp account IP address viewing right from non-temp account wikis (T369187)]], [[gerrit:1073586|Lift IP cap on
[13:11:53] <logmsgbot>	 2024-10-07/08 for edit-a-thon (T374964)]], [[gerrit:1073770|shellbox-video: bypass mesh temporarily (T373517)]], [[gerrit:1073541|Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage (T374372)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:11:56] <Dreamy_Jazz>	 cscott: sergi0: Please test your changes (if any testing is required).
[13:12:08] <Dreamy_Jazz>	 Let me know if you don't need to test it.
[13:12:12] <sergi0>	 Nothing testable, I'll check in testwiki
[13:12:19] <cscott>	 i can check that the defaults changed, hang on
[13:13:07] <Lucas_WMDE>	 dammit, I forgot I scheduled patches for this window
[13:13:20] <Lucas_WMDE>	 unfortunately I also have a meeting with some WMF folks in a few minutes
[13:13:26] <Lucas_WMDE>	 so I think I’ll just pass and try to deploy my patches another time, sorry
[13:13:50] <Dreamy_Jazz>	 No problem. I've merged the patch to test the IP addresses, but left the others.
[13:14:01] <Lucas_WMDE>	 thanks!
[13:14:50] <cscott>	 Dreamy_Jazz: ok, checked & verified. looks good!
[13:14:59] <Dreamy_Jazz>	 Thanks.
[13:15:11] <icinga-wm>	 RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:15:23] <Dreamy_Jazz>	 sergi0: Okay to proceed on your patch given that nothing is testable?
[13:15:32] <sergi0>	 yes
[13:15:48] <Dreamy_Jazz>	 My change is a no-op and I've tested that it doesn't break anything, so proceeding.
[13:15:50] <logmsgbot>	 !log dreamyjazz@deploy1003 sgimeno, anzx, lucaswerkmeister-wmde, cscott, hnowlan, dreamyjazz: Continuing with sync
[13:18:16] <elukey>	 !log restart puppetserver on puppetserver1002 - trashing - T373527
[13:18:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:21] <stashbot>	 T373527: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527
[13:19:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:19:50] <cscott>	 wikitech.wikimedia.org seems to redirect to foundation.wikimedia.org for me.  is that a known thing?
[13:20:08] <Dreamy_Jazz>	 Yes, because you have the debug extension set to enabled
[13:20:09] <wikibugs>	 (03PS3) 10Tiziano Fogli: icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438
[13:20:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:20:23] <cscott>	 Dreamy_Jazz: oh, that's a "feature"?
[13:20:33] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073739|GrowthExperiments: enable Community Updates module in testwiki (T374577)]], [[gerrit:1073487|Check that throttling exceptions use valid public IP addresses (T374980)]], [[gerrit:1073790|Hide temp account IP address viewing right from non-temp account wikis (T369187)]], [[gerrit:1073586|Lift IP cap on 2024-10-07/08 for edit-a-thon (T374964)]
[13:20:33] <logmsgbot>	 ], [[gerrit:1073770|shellbox-video: bypass mesh temporarily (T373517)]], [[gerrit:1073541|Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage (T374372)]] (duration: 11m 08s)
[13:20:40] <stashbot>	 T374577: Community Updates module: Release to Test Wikipedia - https://phabricator.wikimedia.org/T374577
[13:20:40] <hnowlan>	 thanks Dreamy_Jazz! 
[13:20:40] <stashbot>	 T374980: Enforce exclusion of private IP addresses from $wmgThrottlingExceptions in CI - https://phabricator.wikimedia.org/T374980
[13:20:40] <stashbot>	 T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187
[13:20:41] <stashbot>	 T374964: Lift IP cap on this dates 2024-10-07/08 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T374964
[13:20:41] <stashbot>	 T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517
[13:20:41] <stashbot>	 T374372: Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage (week of Sep 16) - https://phabricator.wikimedia.org/T374372
[13:21:40] <wikibugs>	 (03CR) 10Bking: flink-app: customize calico label selector (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[13:22:00] <Dreamy_Jazz>	 If you disable the debug extension and clear your cache it should fix it
[13:22:31] <wikibugs>	 (03PS4) 10Tiziano Fogli: icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438
[13:23:16] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:24:02] <wikibugs>	 (03PS5) 10Tiziano Fogli: icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438
[13:25:08] <Dreamy_Jazz>	 !log Afternoon UTC backport window done
[13:25:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10157009 (10elukey) puppetserver1002 is now running with 35 JRuby workers instead of 48, let's see how it goes at steady state. If everything...
[13:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 (owner: 10Tiziano Fogli)
[13:25:40] <Lucas_WMDE>	 \o/
[13:25:48] <cscott>	 Dreamy_Jazz: thanks!
[13:25:48] <Lucas_WMDE>	 thanks for deploying Dreamy_Jazz!
[13:26:04] <sergi0>	 Thank you @Dreamy_Jazz 
[13:26:04] <Dreamy_Jazz>	 :D
[13:26:44] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 (owner: 10Tiziano Fogli)
[13:26:47] <Dreamy_Jazz>	 The issue with the incorrect redirect should be fixed in a few weeks once wikitech.wikimedia.org is part of the production cluster.
[13:27:55] <wikibugs>	 (03PS2) 10Elukey: Swap poolcounter1004 with poolcounter1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073502 (https://phabricator.wikimedia.org/T332015)
[13:28:16] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 37.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:28:25] <hashar>	 oh wikifunctions again
[13:28:28] <elukey>	 hey folks, since the UTC backport is done I am going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073502 if nobody opposes
[13:28:33] <hashar>	 elukey: please hold
[13:28:39] <elukey>	 sure
[13:28:41] <hashar>	 there are some more patches :)
[13:29:17] <hashar>	 Lucas_WMDE: you aren't pushing the termbox updates?
[13:29:23] <Lucas_WMDE>	 I’m in a meeting
[13:29:25] <hashar>	 elukey: my bad go ahead, all the config patches go tpushed
[13:29:27] <Lucas_WMDE>	 maybe I’ll do them later
[13:29:32] <elukey>	 super thanks!
[13:29:33] <hashar>	 Lucas_WMDE: we can do it together after your meeting :]
[13:29:44] <Lucas_WMDE>	 ok ^^
[13:29:51] <Lucas_WMDE>	 I should be free in 30 minutes from now
[13:29:53] <hashar>	 just poke me when you are done
[13:29:58] <Lucas_WMDE>	 ok, thanks!
[13:29:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by elukey@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073502 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[13:30:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:31:00] <wikibugs>	 (03Merged) 10jenkins-bot: Swap poolcounter1004 with poolcounter1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073502 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[13:31:21] <logmsgbot>	 !log elukey@deploy1003 Started scap sync-world: Backport for [[gerrit:1073502|Swap poolcounter1004 with poolcounter1006 (T332015)]]
[13:31:25] <stashbot>	 T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015
[13:31:53] <cscott>	 I've got a general question about prometheus stats, is this a good place to ask it?
[13:32:23] <cscott>	 q is: how do i test a new metric locally?  i'd like something which just dumped the metric to a log somewhere so I could verify it was being generated correctly.
[13:32:54] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: define an internal service name for the scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073773 (https://phabricator.wikimedia.org/T375072) (owner: 10Brouberol)
[13:33:01] <cscott>	 i think i'd set up a statsd server locally at one point, but my new metrics don't have "backward-compatible" statsd names.
[13:33:33] <cscott>	 $wgStatsdServer is documented, but no mention of prometheus in MainConfigSchema.php ?
[13:33:40] <Lucas_WMDE>	 I’ve used `nc -ukl 8125` before (listen on the statsd port, dump to stdout)
[13:33:42] <logmsgbot>	 !log elukey@deploy1003 elukey: Backport for [[gerrit:1073502|Swap poolcounter1004 with poolcounter1006 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:34:00] <logmsgbot>	 !log elukey@deploy1003 elukey: Continuing with sync
[13:34:27] <cscott>	 yeah, but i'm not calling ::copyToStatsdAt() for these, so I don't think they are going to show up on statsd
[13:35:15] <Lucas_WMDE>	 I see :/
[13:35:45] <wikibugs>	 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10157069 (10jijiki) @Dreamy_Jazz I have update the [[ https://grafana-rw.wikimedia.org/d/6C9Bm6uVz/ipoid?orgId=1 | Grafana dashboard ]], to include any metrics emitted by envoy. Do you have a...
[13:36:55] <Lucas_WMDE>	 FWIW, I think #wikimedia-observability is the channel where I got some pretty good help on statslib-related questions before
[13:37:03] <cscott>	 irc or slack/
[13:37:10] <Lucas_WMDE>	 IRC
[13:37:14] <cscott>	 ok, thanks!
[13:37:19] <Lucas_WMDE>	 right, slack reuses the #, I forgot ^^
[13:37:31] <wikibugs>	 06SRE, 10observability, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10157088 (10fgiunchedi)
[13:38:36] <logmsgbot>	 !log elukey@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073502|Swap poolcounter1004 with poolcounter1006 (T332015)]] (duration: 07m 15s)
[13:38:41] <stashbot>	 T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015
[13:40:35] <wikibugs>	 (03CR) 10Elukey: "Looks good! Since this is a big jump, have you tried to install the packages on a Debian Bookworm container (or similar)? I am wondering i" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman)
[13:42:22] <wikibugs>	 (03PS2) 10Elukey: Swap poolcounter1005 with poolcounter1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073503 (https://phabricator.wikimedia.org/T332015)
[13:43:09] <wikibugs>	 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10157120 (10Dreamy_Jazz) >>! In T375006#10157069, @jijiki wrote: > @Dreamy_Jazz I have updated the [[ https://grafana-rw.wikimedia.org/d/6C9Bm6uVz/ipoid?orgId=1 | Grafana dashboard ]], to inc...
[13:44:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by elukey@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073503 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[13:44:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:46:06] <wikibugs>	 (03Merged) 10jenkins-bot: Swap poolcounter1005 with poolcounter1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073503 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[13:46:26] <wikibugs>	 (03CR) 10Klausman: "My private workstation has a 7900XTX (gfx1100 from generation pov) GPU, and is running Trixie (-> kernel version). I created a chroot usin" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman)
[13:46:27] <logmsgbot>	 !log elukey@deploy1003 Started scap sync-world: Backport for [[gerrit:1073503|Swap poolcounter1005 with poolcounter1007 (T332015)]]
[13:46:32] <wikibugs>	 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10157135 (10Dreamy_Jazz) Looking at the data that is now in Grafana (thanks for doing that btw :D ), it seems that the server is responding with 500 errors when these connection timeouts occu...
[13:46:33] <stashbot>	 T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015
[13:48:35] <logmsgbot>	 !log elukey@deploy1003 elukey: Backport for [[gerrit:1073503|Swap poolcounter1005 with poolcounter1007 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:49:06] <wikibugs>	 (03CR) 10Elukey: "Super, seems perfect! I noticed that you added the new component under bullseye-wikimedia, should it be bookworm-wikimedia?" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman)
[13:49:16] <logmsgbot>	 !log elukey@deploy1003 elukey: Continuing with sync
[13:50:27] <wikibugs>	 (03PS3) 10Brouberol: airflow: allow the webserver and scheduler to be deployed or not [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936)
[13:52:07] <wikibugs>	 (03PS4) 10Brouberol: airflow: allow the webserver and scheduler to be selectively deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936)
[13:52:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:53:21] <wikibugs>	 (03CR) 10Herron: "nice one thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1073749 (https://phabricator.wikimedia.org/T375066) (owner: 10Tiziano Fogli)
[13:53:30] <wikibugs>	 (03PS10) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[13:53:50] <wikibugs>	 (03PS5) 10Brouberol: airflow: allow the webserver and scheduler to be selectively deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936)
[13:53:51] <logmsgbot>	 !log elukey@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073503|Swap poolcounter1005 with poolcounter1007 (T332015)]] (duration: 07m 23s)
[13:53:55] <stashbot>	 T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015
[13:55:16] <wikibugs>	 (03PS1) 10Ssingh: haproxy: switch order of TLS1.3 ciphers [puppet] - 10https://gerrit.wikimedia.org/r/1073798 (https://phabricator.wikimedia.org/T365327)
[13:56:10] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:56:13] <wikibugs>	 (03PS2) 10Klausman: aptrepo: Add ROCm61 component for ML-Labs machines [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076)
[13:56:31] <wikibugs>	 (03CR) 10Klausman: "Ah, the Bullseye bit was my bad. Fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman)
[13:56:52] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4013/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073798 (https://phabricator.wikimedia.org/T365327) (owner: 10Ssingh)
[13:57:18] <wikibugs>	 (03PS11) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[13:57:35] <wikibugs>	 (03CR) 10Bking: flink-app: customize calico label selector (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1400)
[14:00:46] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Left a nit, once fixed feel free to merge!" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman)
[14:03:08] <wikibugs>	 (03PS1) 10Elukey: services: remove old poolcounter nodes from MW's net policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073802 (https://phabricator.wikimedia.org/T332015)
[14:06:48] <wikibugs>	 (03PS3) 10Klausman: aptrepo: Add ROCm61 component for ML-Labs machines [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076)
[14:07:00] <wikibugs>	 (03CR) 10Klausman: aptrepo: Add ROCm61 component for ML-Labs machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman)
[14:07:29] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:07:46] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:07:58] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: cloudgw: add support and enable IPv6 - https://phabricator.wikimedia.org/T374716#10157199 (10aborrero) p:05Triage→03Medium
[14:08:00] <wikibugs>	 06SRE, 10Observability-Metrics, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10157200 (10lmata)
[14:08:06] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10157201 (10aborrero) p:05Triage→03Medium
[14:08:12] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: verify security groups settings for IPv6 - https://phabricator.wikimedia.org/T374714#10157202 (10aborrero) p:05Triage→03Medium
[14:08:26] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10157203 (10aborrero) p:05Triage→03Medium
[14:13:38] <wikibugs>	 10ops-eqiad, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10157238 (10aborrero) p:05Triage→03Medium
[14:13:43] <wikibugs>	 10ops-eqiad, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10157235 (10aborrero) hey @VRiley-WMF could you please advice what should we do with the memory error in this server?
[14:14:29] <wikibugs>	 (03PS12) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[14:14:34] <wikibugs>	 (03CR) 10Bking: flink-app: customize calico label selector (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[14:16:38] <wikibugs>	 (03PS1) 10Ssingh: wikidough: change order of TLS1.3 cipher suites [puppet] - 10https://gerrit.wikimedia.org/r/1073803 (https://phabricator.wikimedia.org/T365327)
[14:17:34] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4014/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073803 (https://phabricator.wikimedia.org/T365327) (owner: 10Ssingh)
[14:18:48] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] wikidough: change order of TLS1.3 cipher suites [puppet] - 10https://gerrit.wikimedia.org/r/1073803 (https://phabricator.wikimedia.org/T365327) (owner: 10Ssingh)
[14:19:14] <wikibugs>	 (03PS4) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891)
[14:19:24] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:19:46] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:21:05] <wikibugs>	 (03PS5) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284)
[14:22:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol)
[14:23:47] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:24:00] <sukhe>	 !log run puppet agent on A:wikidough
[14:24:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:33] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:25:59] <wikibugs>	 (03PS6) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284)
[14:26:45] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough
[14:32:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins)
[14:33:08] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:33:20] <sukhe>	 ^ expected, rolling restarts of Wikimedia DNS
[14:33:41] <wikibugs>	 (03PS3) 10EoghanGaffney: contint: switch java_home from jdk-11 to jdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[14:33:43] <wikibugs>	 (03PS1) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908)
[14:34:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[14:34:10] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): openstack: create a cookbook to inject commands to VMs via console at scale - https://phabricator.wikimedia.org/T347683#10157384 (10aborrero) p:05Triage→03Low
[14:35:44] <wikibugs>	 (03PS2) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908)
[14:36:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[14:36:46] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:38:00] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Observability-Alerting, 10Sustainability (Incident Followup): monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs" - https://phabricator.wikimedia.org/T347694#10157385 (10dcaro) 05Open→03Resolv...
[14:38:08] <icinga-wm>	 RECOVERY - MD RAID on puppetmaster1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:38:34] <wikibugs>	 (03PS3) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908)
[14:38:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[14:39:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:40:17] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough
[14:42:46] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:44:23] <wikibugs>	 (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[14:44:54] <wikibugs>	 (03PS5) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418)
[14:45:02] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:45:07] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:46:30] <wikibugs>	 (03PS4) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908)
[14:46:32] <Lucas_WMDE>	 jouncebot: nowandnext
[14:46:32] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1400)
[14:46:32] <jouncebot>	 In 0 hour(s) and 13 minute(s): Alert hosts failover to alert1002 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1500)
[14:46:44] <Lucas_WMDE>	 hashar: I totally forgot to ping you, sorry
[14:46:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[14:46:59] <Lucas_WMDE>	 probably not a good time right now, don’t think we want to cut into the alert hosts failover window
[14:46:59] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:47:04] <Lucas_WMDE>	 (and CI is definitely going to take more than 13 minutes)
[14:47:09] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:48:24] <wikibugs>	 (03PS5) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908)
[14:48:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[14:49:00] <hashar>	 Lucas_WMDE: no worries, we can do both at the same time?
[14:49:38] <hashar>	 ah alert host grr
[14:49:54] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxy: switch order of TLS1.3 ciphers [puppet] - 10https://gerrit.wikimedia.org/r/1073798 (https://phabricator.wikimedia.org/T365327) (owner: 10Ssingh)
[14:49:55] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:50:22] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:50:34] <wikibugs>	 (03PS6) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908)
[14:50:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[14:50:47] <hashar>	 Lucas_WMDE: then that termbox patch is a frontend fix isn't it? My guess is we can merge both and deploy after alert has been switched
[14:50:53] <Lucas_WMDE>	 hashar: actually, now that group1 is on wmf.23, I guess the wmf.22 backport can already be discarded anyway
[14:50:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[14:51:03] <Lucas_WMDE>	 it should only affect the frontend yeah
[14:51:09] <Lucas_WMDE>	 I don’t think we load any PHP code from that submodule
[14:51:17] <wikibugs>	 06SRE, 06cloud-services-team, 06Traffic, 13Patch-For-Review: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463#10157462 (10joanna_borun) p:05Triage→03Low
[14:52:01] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 55655
[14:52:07] <hashar>	 hmm and somehow the gate pipeline worked yesterday but the change did not merge bah
[14:52:19] <Lucas_WMDE>	 I removed the +2s because the scap backport had already died
[14:52:25] <Lucas_WMDE>	 (due to the failed test builds I think)
[14:52:35] <hashar>	 lets +2 the wmf.23 one
[14:52:35] <wikibugs>	 (03PS7) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908)
[14:52:39] <Lucas_WMDE>	 (though at the time I thought one of the failed builds was a gate-and-submit build. I didn’t know scap backport also died on failed test builds)
[14:52:40] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE))
[14:52:43] <Lucas_WMDE>	 ok!
[14:52:50] <icinga-wm>	 PROBLEM - poolcounter on poolcounter1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd https://www.mediawiki.org/wiki/PoolCounter
[14:52:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[14:53:00] <wikibugs>	 06SRE, 06cloud-services-team, 06Traffic, 13Patch-For-Review: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463#10157468 (10dcaro) Still some stuff to be changed: https://codesearch.wmcloud.org/search/?q=labweb
[14:53:27] <wikibugs>	 (03Abandoned) 10Lucas Werkmeister (WMDE): Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073479 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE))
[14:53:32] <Lucas_WMDE>	 I abandoned the wmf.22 one
[14:53:36] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 55655
[14:53:37] <Lucas_WMDE>	 (can still be restored if needed ^^)
[14:53:50] <icinga-wm>	 RECOVERY - poolcounter on poolcounter1004 is OK: PROCS OK: 1 process with command name poolcounterd https://www.mediawiki.org/wiki/PoolCounter
[14:54:01] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:54:19] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:54:28] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:54:33] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:55:52] <wikibugs>	 (03PS4) 10Hashar: contint: switch Jenkins to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[14:55:53] <elukey>	 !log restart poolcounter on poolcounter100[4,5] (depooled nodes) to clear old/stale TCP conns for port 7531
[14:55:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:11] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Eoghan and I will deploy it on Thursday 19 Sep at 8:30 UTC." [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[14:58:31] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#10157485 (10Scott_French) Thanks for taking a look, Riccardo. I should mention, this isn't blocking anything on our end, as I can always do somet...
[15:00:05] <jouncebot>	 denisse and godog: It is that lovely time of the day again! You are hereby commanded to deploy Alert hosts failover to alert1002. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1500).
[15:00:13] <denisse>	 godog: Ready!
[15:00:22] <godog>	 denisse: sweet, same
[15:00:28] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:00:50] <denisse>	 !log Disable meta-monitoring for the alert hosts - T372418
[15:00:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:54] <stashbot>	 T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418
[15:01:40] <wikibugs>	 (03PS8) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908)
[15:01:44] <denisse>	 !log Make alert1002 the active host - T372418
[15:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:02] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[15:02:05] <wikibugs>	 (03CR) 10Bking: [C:03+2] flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[15:02:12] <wikibugs>	 (03PS7) 10Andrea Denisse: alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418)
[15:02:23] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4022/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[15:02:27] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging based on verbal approval during pairing session" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[15:02:27] <sukhe>	 !log sudo cumin "A:cp" 'disable-puppet "merging CR 1073798"': T365327
[15:02:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:36] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+2 C:03+2] alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[15:03:26] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[15:03:40] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] haproxy: switch order of TLS1.3 ciphers [puppet] - 10https://gerrit.wikimedia.org/r/1073798 (https://phabricator.wikimedia.org/T365327) (owner: 10Ssingh)
[15:03:50] <wikibugs>	 (03PS1) 10Elukey: services: update Tegola's Docker image to pick up package upgrades [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073818 (https://phabricator.wikimedia.org/T373976)
[15:03:58] <_joe_>	 !log uploading conftool 3.2.4 to apt T375059
[15:04:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:03] <stashbot>	 T375059: Requestctl sync writes unchanged objects - https://phabricator.wikimedia.org/T375059
[15:06:18] <hashar>	 denisse: can you let me know once alert has been switched over?  I will deploy a MediaWiki update once you are done :)
[15:06:48] <denisse>	 hashar: For sure, I'll let you know, thank you.
[15:07:15] <hashar>	 Lucas_WMDE: of course something unrelated exploded :/
[15:07:33] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] tftpboot: purge old files [puppet] - 10https://gerrit.wikimedia.org/r/1073532 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway)
[15:07:53] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10157535 (10VRiley-WMF) a:03VRiley-WMF
[15:08:22] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[15:08:44] <denisse>	 !log Resolve alerts DNS queries to alert1002 - T372418
[15:08:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:51] <stashbot>	 T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418
[15:11:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE))
[15:11:58] <Lucas_WMDE>	 :(
[15:12:13] <hashar>	 maybe cause tests are running in parallel
[15:12:25] <Lucas_WMDE>	 this one looks familiar
[15:12:46] <Lucas_WMDE>	 aha, https://phabricator.wikimedia.org/T374912
[15:12:54] <Lucas_WMDE>	 that was indeed related to the parallel tests (IIUC)
[15:13:12] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:13:18] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service puppetmaster2001:8140 has failed probes (http_puppetmaster2001_codfw_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:13:22] <Lucas_WMDE>	 I wonder if it’s flaky or if that CheckUser fix needs to be backported for the backport to pass
[15:13:35] <hashar>	 oh great
[15:13:45] <hashar>	 well I guess I can backport the fix :)
[15:13:53] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service puppetmaster2001:8141 has failed probes (http_puppetmaster2001_codfw_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:14:06] <_joe_>	 uh
[15:14:07] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2013.codfw.wmnet
[15:14:22] <mutante>	 maybe it would be good to pause all activity for a while
[15:14:27] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:14:27] <mutante>	 and ignore the alerts
[15:14:39] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2013.codfw.wmnet
[15:14:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloud: codfw1dev: have a new bastion host in bastion-codfw1dev-04 [puppet] - 10https://gerrit.wikimedia.org/r/1073205 (https://phabricator.wikimedia.org/T374828) (owner: 10Arturo Borrero Gonzalez)
[15:14:46] <_joe_>	 well it's hard to ignroe alerts
[15:14:49] <mutante>	 the switching of the alerting server seems a big deal to me
[15:14:49] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2014.codfw.wmnet
[15:14:50] <wikibugs>	 (03PS1) 10Hashar: Add scope to temporary users created by populate tables test [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073823 (https://phabricator.wikimedia.org/T374912)
[15:14:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:15:25] <wikibugs>	 (03PS2) 10Hashar: Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE))
[15:15:26] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:15:26] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2014.codfw.wmnet
[15:15:36] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2024.codfw.wmnet
[15:15:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job icinga-am in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:16:09] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2024.codfw.wmnet
[15:16:19] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2048.codfw.wmnet
[15:16:41] <wikibugs>	 (03CR) 10Hashar: [C:03+2] "Retrying due to `SpecialCentralAuthTest::testViewForExistingGlobalTemporaryAccount` failing to find `centralauth-admin-info-expired` / T37" [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE))
[15:16:52] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2048.codfw.wmnet
[15:17:03] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2049.codfw.wmnet
[15:17:19] <wikibugs>	 (03CR) 10Hashar: [C:03+2] "Cherry picked to let us backport the Wikibase change https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1073478" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073823 (https://phabricator.wikimedia.org/T374912) (owner: 10Hashar)
[15:17:35] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2049.codfw.wmnet
[15:17:46] <_joe_>	 is it expected that so many probes would fail?
[15:17:46] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2050.codfw.wmnet
[15:18:03] <hashar>	 Lucas_WMDE: thanks for having found the CheckUser fix. I have backported it / +2ed it for wmf.23 and made your Wikibase change depends on it and +2ed it as well. We will see!
[15:18:12] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:18:13] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:18:18] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2050.codfw.wmnet
[15:18:29] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2051.codfw.wmnet
[15:18:47] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on vrts2002.codfw.wmnet with reason: Migration
[15:18:58] <Lucas_WMDE>	 hashar: thanks!
[15:19:01] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on vrts2002.codfw.wmnet with reason: Migration
[15:19:05] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2051.codfw.wmnet
[15:19:14] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[15:19:15] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2444.codfw.wmnet
[15:19:37] <Dreamy_Jazz>	 Thanks for backporting that fix.
[15:19:39] <sukhe>	 !log rolling out TLS1.3 cipher suite priority order change CR 1073798 to all cp hosts
[15:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:43] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[15:19:48] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2444.codfw.wmnet
[15:19:52] <Dreamy_Jazz>	 jouncebot: nowandnext
[15:19:52] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): Alert hosts failover to alert1002 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1500)
[15:19:52] <jouncebot>	 In 1 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700)
[15:19:58] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2445.codfw.wmnet
[15:20:20] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.remove-downtime for vrts2002.codfw.wmnet
[15:20:20] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for vrts2002.codfw.wmnet
[15:20:20] <Dreamy_Jazz>	 Just checking the calendar for any free spots in a 30 mins or so
[15:20:31] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2445.codfw.wmnet
[15:20:42] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Migration
[15:20:46] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Migration
[15:20:56] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add nr to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1073827 (https://phabricator.wikimedia.org/T375087)
[15:21:07] <mutante>	 Dreamy_Jazz: this is probably not the best time to deploy
[15:21:08] <denisse>	 !log Enable metamonitoring for the alert1002, and alert2002 hosts - T372418
[15:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:12] <stashbot>	 T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418
[15:21:33] <Dreamy_Jazz>	 Sure. Would be it be better once the current window is over?
[15:21:46] <denisse>	 godog: I think we're done, everything looks good to me. What do you think?
[15:21:59] <denisse>	 Dreamy_Jazz: Please give me a couple of minutes and I'll let you know once we're donee.
[15:22:09] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[15:22:16] <Dreamy_Jazz>	 That's fine. I'm definitely going to wait until the Wikibase change is backported.
[15:22:26] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[15:22:46] <godog>	 denisse: sgtm
[15:23:34] <denisse>	 Yes, I think we're done.
[15:23:47] <denisse>	 hashar Dreamy_Jazz You can deploy now, thanks for your patience.
[15:24:23] <hashar>	 thx!
[15:24:35] <wikibugs>	 (03PS9) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908)
[15:25:12] <denisse>	 godog: I think we can proceed with the decommission of the old hosts now, I already have patches for that. https://phabricator.wikimedia.org/T372607
[15:25:16] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4024/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[15:25:25] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4023/console" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman)
[15:25:30] <denisse>	 Or we could wait for next week if that's more appropriate.
[15:25:36] <logmsgbot>	 !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
[15:25:50] <godog>	 denisse: yeah let's wait a few days, I'll be reviewing your patches
[15:26:04] <logmsgbot>	 !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
[15:26:05] <wikibugs>	 (03CR) 10Klausman: [V:03+1 C:03+2] aptrepo: Add ROCm61 component for ML-Labs machines [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman)
[15:26:23] <denisse>	 godog: Thank you, I'll also double check them to ensure everything is correct.
[15:26:28] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2446.codfw.wmnet
[15:26:35] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10157756 (10Vgutierrez) p:05Triage→03Medium
[15:26:45] <wikibugs>	 (03CR) 10Volans: "nit/question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh)
[15:27:01] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2446.codfw.wmnet
[15:27:02] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Add scope to temporary users created by populate tables test [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073823 (https://phabricator.wikimedia.org/T374912) (owner: 10Hashar)
[15:27:11] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2447.codfw.wmnet
[15:27:43] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2447.codfw.wmnet
[15:27:54] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2448.codfw.wmnet
[15:28:30] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2448.codfw.wmnet
[15:28:40] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2449.codfw.wmnet
[15:28:41] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10157766 (10Vgutierrez) 05Open→03Stalled a:03Vgutierrez per data.yaml we need approval from @odimitrijevic / @Milimetric / @WDoranWMF / @Ahoelzl / @Ottomata (one of them is enough)
[15:28:48] <wikibugs>	 (03Abandoned) 10Ladsgroup: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1040887 (https://phabricator.wikimedia.org/T367020) (owner: 10Gerrit maintenance bot)
[15:29:01] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Add nr to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1073827 (https://phabricator.wikimedia.org/T375087) (owner: 10Gerrit maintenance bot)
[15:29:16] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2449.codfw.wmnet
[15:29:27] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2450.codfw.wmnet
[15:30:00] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2450.codfw.wmnet
[15:30:10] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2451.codfw.wmnet
[15:30:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "you beat me in the race, patch already open :p" [dns] - 10https://gerrit.wikimedia.org/r/1073827 (https://phabricator.wikimedia.org/T375087) (owner: 10Gerrit maintenance bot)
[15:30:17] <logmsgbot>	 !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
[15:30:29] <logmsgbot>	 !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
[15:30:43] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2451.codfw.wmnet
[15:46:01] <wikibugs>	 (03PS3) 10Vgutierrez: admin: Grant cyndywikime shell and analytics_privatedata_users access [puppet] - 10https://gerrit.wikimedia.org/r/1073834 (https://phabricator.wikimedia.org/T375060)
[15:46:36] <wikibugs>	 (03PS1) 10Jdlrobson: Limit quick surveys to wikis with messages defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073839 (https://phabricator.wikimedia.org/T374654)
[15:46:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073839 (https://phabricator.wikimedia.org/T374654) (owner: 10Jdlrobson)
[15:46:55] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add rclone to db1208 for testing s3 -> local backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[15:47:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "thanks for adding that comment. and user details look all good to me. just needs the approval." [puppet] - 10https://gerrit.wikimedia.org/r/1073834 (https://phabricator.wikimedia.org/T375060) (owner: 10Vgutierrez)
[15:48:10] <wikibugs>	 (03PS1) 10Hnowlan: Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517)
[15:48:25] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[15:48:26] <wikibugs>	 (03CR) 10Ssingh: sre.cdn.pdns-recursor: add rolling restart script (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins)
[15:48:42] <hashar>	 Lucas_WMDE: patches are almost merged
[15:48:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[15:48:53] * hashar grabs chocolate
[15:49:55] <wikibugs>	 (03Merged) 10jenkins-bot: Add scope to temporary users created by populate tables test [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073823 (https://phabricator.wikimedia.org/T374912) (owner: 10Hashar)
[15:49:58] <wikibugs>	 (03Merged) 10jenkins-bot: Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE))
[15:50:33] <Lucas_WMDE>	 yay
[15:51:07] <hashar>	 so hmm
[15:51:30] <hashar>	 oh my god
[15:51:37] <hashar>	 I do a git remote update ont he deployment server and...
[15:51:40] <hashar>	 fatal: exec 'rev-list': cd to 'view/lib/wikibase-termbox' failed: No such file or directory
[15:52:31] <hashar>	 oh wrong branch
[15:53:25] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10157879 (10Vgutierrez)
[15:54:09] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10157890 (10Vgutierrez) SSH key has been confirmed out of band
[15:54:14] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Update termbox (mul support) - T373088
[15:54:19] <stashbot>	 T373088: [MUL] placeholder labels not appearing on mobile - https://phabricator.wikimedia.org/T373088
[15:54:34] <wikibugs>	 (03PS2) 10Hnowlan: Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517)
[15:55:02] <wikibugs>	 (03PS3) 10Hnowlan: Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517)
[15:55:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Wording nitpick, but the rest LGTM. Feel free to merge" [puppet] - 10https://gerrit.wikimedia.org/r/1073838 (https://phabricator.wikimedia.org/T348876) (owner: 10Elukey)
[15:55:38] <elukey>	 !log deploy python3-setuptools upgrades fleetwide
[15:55:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:14] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Permit db1208 to access the Ceph/S3 endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1073837 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[15:58:09] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: Move servers in codfw rack D5
[15:58:12] <wikibugs>	 (03PS2) 10Elukey: profile::docker::reporter: fix k8s_rules.ini [puppet] - 10https://gerrit.wikimedia.org/r/1073838 (https://phabricator.wikimedia.org/T348876)
[15:58:32] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: Move servers in codfw rack D5
[15:58:43] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::docker::reporter: fix k8s_rules.ini (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1073838 (https://phabricator.wikimedia.org/T348876) (owner: 10Elukey)
[15:58:43] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10157915 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7e878ed4-7126-4f45-87aa-d1087aacf81a) set by cmooney@cumin100...
[16:00:27] <topranks>	 !log moving servers in codfw rack D5 from asw-d5-codfw to lsw1-d5-codfw T373104
[16:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:42] <stashbot>	 T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104
[16:01:03] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Update termbox (mul support) - T373088 (duration: 06m 48s)
[16:01:16] <stashbot>	 T373088: [MUL] placeholder labels not appearing on mobile - https://phabricator.wikimedia.org/T373088
[16:01:24] <hashar>	 Lucas_WMDE: I think I have deployed it
[16:01:44] <Lucas_WMDE>	 no test servers?
[16:02:07] <hashar>	 since there were two patches I went to do a submodule update and `scap sync-world`
[16:02:15] <hashar>	 which well yeah, deploys straight to everything
[16:02:16] <hashar>	 :/
[16:02:23] <Lucas_WMDE>	 `scap backport` would’ve supported that AFAIK
[16:02:30] <Lucas_WMDE>	 you can specify more than one URL (or patch number)
[16:02:38] <Lucas_WMDE>	 but anyway, with ?debug=2 the new JS code seems to work \o/
[16:02:51] * hashar blames cache
[16:02:55] <Lucas_WMDE>	 (ah, and with ?action=purge too)
[16:02:56] <hashar>	 great! thank you for the verification
[16:03:04] <Lucas_WMDE>	 thanks for deploying!
[16:03:22] <hashar>	 with https://m.wikidata.org/wiki/Q42?q=SELECT%20*;  it works as well
[16:03:55] <hashar>	 so that is cached in the frontend cache
[16:04:38] <wikibugs>	 (03CR) 10Dzahn: "I am a bit conflicted here. We actually did not see matching throttle events in the dashboards after all. It seems like it could have also" [puppet] - 10https://gerrit.wikimedia.org/r/1073740 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[16:05:26] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater: remove references to old-style network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073842 (https://phabricator.wikimedia.org/T373195)
[16:06:13] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:25:00 on 24 hosts with reason: Move servers in codfw rack D6
[16:06:27] <wikibugs>	 (03PS15) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[16:06:36] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on 24 hosts with reason: Move servers in codfw rack D6
[16:06:45] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10157936 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9cef1cb8-6d99-4d39-b2db-e242da2fe3f6) set by cmooney@cumin100...
[16:07:10] <topranks>	 !log moving servers in codfw rack D6 from asw-d6-codfw to lsw1-d6-codfw T373104
[16:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:14] <stashbot>	 T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104
[16:07:24] <wikibugs>	 (03CR) 10Bking: [C:03+2] rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[16:08:21] <wikibugs>	 (03CR) 10Dzahn: "If we merge this now we might end up in situation where it doesn't happen again but we never know why and if it was an unrelated glitch or" [puppet] - 10https://gerrit.wikimedia.org/r/1073740 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[16:10:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10157954 (10phaultfinder)
[16:12:45] <wikibugs>	 (03CR) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh)
[16:14:15] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:15:03] <wikibugs>	 (03PS1) 10Dreamy Jazz: Autopromote users into checkuser-temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073844 (https://phabricator.wikimedia.org/T369187)
[16:15:21] <Dreamy_Jazz>	 jouncebot: nowandnext
[16:15:21] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 44 minute(s)
[16:15:21] <jouncebot>	 In 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700)
[16:21:28] <wikibugs>	 (03PS2) 10Dreamy Jazz: Autopromote users into checkuser-temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073844 (https://phabricator.wikimedia.org/T369187)
[16:21:48] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10157981 (10cmooney) All hosts moved and responding to ping again.  Thanks all for the help!
[16:23:14] <wikibugs>	 (03CR) 10Volans: "clarifications inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh)
[16:23:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69262 and previous config saved to /var/cache/conftool/dbconfig/20240918-162316-arnaudb.json
[16:23:21] <stashbot>	 T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104
[16:23:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69263 and previous config saved to /var/cache/conftool/dbconfig/20240918-162321-arnaudb.json
[16:23:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69264 and previous config saved to /var/cache/conftool/dbconfig/20240918-162326-arnaudb.json
[16:23:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69265 and previous config saved to /var/cache/conftool/dbconfig/20240918-162331-arnaudb.json
[16:23:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69266 and previous config saved to /var/cache/conftool/dbconfig/20240918-162341-arnaudb.json
[16:23:47] <Dreamy_Jazz>	 jouncebot: nowandnext
[16:23:47] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 36 minute(s)
[16:23:47] <jouncebot>	 In 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700)
[16:23:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69267 and previous config saved to /var/cache/conftool/dbconfig/20240918-162346-arnaudb.json
[16:23:51] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2013.codfw.wmnet
[16:23:51] <Dreamy_Jazz>	 Going to deploy now
[16:23:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69268 and previous config saved to /var/cache/conftool/dbconfig/20240918-162351-arnaudb.json
[16:23:53] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2013.codfw.wmnet
[16:23:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69269 and previous config saved to /var/cache/conftool/dbconfig/20240918-162357-arnaudb.json
[16:24:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69270 and previous config saved to /var/cache/conftool/dbconfig/20240918-162401-arnaudb.json
[16:24:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69271 and previous config saved to /var/cache/conftool/dbconfig/20240918-162406-arnaudb.json
[16:24:22] <wikibugs>	 (03CR) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh)
[16:25:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073844 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz)
[16:25:36] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2014.codfw.wmnet
[16:25:38] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2014.codfw.wmnet
[16:25:52] <wikibugs>	 (03Merged) 10jenkins-bot: Autopromote users into checkuser-temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073844 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz)
[16:25:53] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10158024 (10ABran-WMF) nodes repooling, haproxy reloaded, thanks for the update @cmooney   @Ladsgroup I'll get to T375050
[16:25:54] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2024.codfw.wmnet
[16:25:56] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2024.codfw.wmnet
[16:26:12] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2048.codfw.wmnet
[16:26:14] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073844|Autopromote users into checkuser-temporary-account-viewer (T369187 T327913)]]
[16:26:14] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2048.codfw.wmnet
[16:26:23] <stashbot>	 T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187
[16:26:23] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T375050
[16:26:24] <stashbot>	 T327913: Assign checkuser-temporary-account right to various groups - https://phabricator.wikimedia.org/T327913
[16:26:30] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2049.codfw.wmnet
[16:26:31] <stashbot>	 T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050
[16:26:32] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2049.codfw.wmnet
[16:26:47] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2050.codfw.wmnet
[16:26:49] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T375050
[16:26:49] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2050.codfw.wmnet
[16:27:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2218 with weight 0 T375050', diff saved to https://phabricator.wikimedia.org/P69272 and previous config saved to /var/cache/conftool/dbconfig/20240918-162703-arnaudb.json
[16:27:05] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2051.codfw.wmnet
[16:27:07] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2051.codfw.wmnet
[16:27:23] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2444.codfw.wmnet
[16:27:25] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2444.codfw.wmnet
[16:27:40] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2445.codfw.wmnet
[16:27:42] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2445.codfw.wmnet
[16:27:58] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2446.codfw.wmnet
[16:28:00] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2446.codfw.wmnet
[16:28:04] <wikibugs>	 (03PS5) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042
[16:28:15] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2447.codfw.wmnet
[16:28:18] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2447.codfw.wmnet
[16:28:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2218 from API/vslow/dump T375050', diff saved to https://phabricator.wikimedia.org/P69273 and previous config saved to /var/cache/conftool/dbconfig/20240918-162822-arnaudb.json
[16:28:31] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1073844|Autopromote users into checkuser-temporary-account-viewer (T369187 T327913)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:28:33] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2448.codfw.wmnet
[16:28:35] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2448.codfw.wmnet
[16:28:50] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2449.codfw.wmnet
[16:28:52] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2449.codfw.wmnet
[16:29:08] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2450.codfw.wmnet
[16:29:10] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2450.codfw.wmnet
[16:29:26] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2451.codfw.wmnet
[16:29:28] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2451.codfw.wmnet
[16:29:44] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2016.codfw.wmnet
[16:29:46] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2016.codfw.wmnet
[16:29:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:30:01] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2017.codfw.wmnet
[16:30:03] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2017.codfw.wmnet
[16:32:14] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1073699 (https://phabricator.wikimedia.org/T375050) (owner: 10Gerrit maintenance bot)
[16:33:25] <arnaudb>	 !log Starting s7 codfw failover from db2220 to db2218 - T375050
[16:33:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:30] <stashbot>	 T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050
[16:34:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2218 to s7 primary T375050', diff saved to https://phabricator.wikimedia.org/P69274 and previous config saved to /var/cache/conftool/dbconfig/20240918-163404-arnaudb.json
[16:35:32] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[16:36:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10158110 (10phaultfinder)
[16:36:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 T375050', diff saved to https://phabricator.wikimedia.org/P69275 and previous config saved to /var/cache/conftool/dbconfig/20240918-163637-arnaudb.json
[16:37:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 5%: T375050', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240918-163721-arnaudb.json
[16:38:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69276 and previous config saved to /var/cache/conftool/dbconfig/20240918-163822-arnaudb.json
[16:38:27] <stashbot>	 T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104
[16:38:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69277 and previous config saved to /var/cache/conftool/dbconfig/20240918-163827-arnaudb.json
[16:38:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69278 and previous config saved to /var/cache/conftool/dbconfig/20240918-163832-arnaudb.json
[16:38:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69279 and previous config saved to /var/cache/conftool/dbconfig/20240918-163837-arnaudb.json
[16:38:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69280 and previous config saved to /var/cache/conftool/dbconfig/20240918-163847-arnaudb.json
[16:38:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69281 and previous config saved to /var/cache/conftool/dbconfig/20240918-163852-arnaudb.json
[16:38:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69282 and previous config saved to /var/cache/conftool/dbconfig/20240918-163857-arnaudb.json
[16:39:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69283 and previous config saved to /var/cache/conftool/dbconfig/20240918-163902-arnaudb.json
[16:39:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69284 and previous config saved to /var/cache/conftool/dbconfig/20240918-163907-arnaudb.json
[16:40:21] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073844|Autopromote users into checkuser-temporary-account-viewer (T369187 T327913)]] (duration: 14m 06s)
[16:40:26] <stashbot>	 T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187
[16:40:26] <stashbot>	 T327913: Assign checkuser-temporary-account right to various groups - https://phabricator.wikimedia.org/T327913
[16:42:50] <sukhe>	 !log sudo cumin "A:cp" 'disable-puppet "merging CR 1073453"': T347114
[16:42:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:54] <stashbot>	 T347114: NetworkProbeLimit cookie for Probenet overwritten on every link hover event - https://phabricator.wikimedia.org/T347114
[16:43:19] <wikibugs>	 (03PS6) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042
[16:43:47] <wikibugs>	 (03CR) 10Ssingh: "CI failure expected as US_DATACENTERS does not exist in the currently deployed version of wmflib. We will recheck but I wanted to get this" [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh)
[16:45:38] <wikibugs>	 (03PS2) 10AOkoth: wmnet: change ticket to vrts1003 [dns] - 10https://gerrit.wikimedia.org/r/1073490 (https://phabricator.wikimedia.org/T373420)
[16:45:42] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] NetworkProbeLimit Cookie: avoid nop re-set-cookie [puppet] - 10https://gerrit.wikimedia.org/r/1073453 (https://phabricator.wikimedia.org/T347114) (owner: 10BBlack)
[16:46:40] <mutante>	 we are debugging why sirenbot is doing that
[16:46:48] <mutante>	 for some reason it can't write to the local sqlite db 
[16:46:55] <sukhe>	 oh good catch, not sure
[16:46:59] <mutante>	 but there seems to be no reason for that
[16:47:12] <mutante>	 same permissions compared to other host
[16:47:14] <mutante>	 same sqlite and all
[16:47:41] <mutante>	 so it joins, reads the channel topic, wants to write it to sqlite and fails
[16:47:53] <mutante>	 the db file is there but empty.. 
[16:50:03] <volans>	 there are no tables mutante, probably it needs them
[16:50:07] <volans>	  error="no such table: topics"
[16:50:37] <volans>	 I guess it needed some schema pre-loaded in the db or some init to call or carry over the pre-existing db in another host
[16:50:50] * volans not familiar with it, so just mentioning common scenarios
[16:50:52] <mutante>	 volans: good find, thanks. So I guess we could just copy the file from the other host.. but it's still a mystery since denisse reports they didnt have to do that last time and it all just worked without doing that
[16:51:06] <mutante>	 the "maybe it needs an init somehow" was already a guess
[16:51:43] <denisse>	 What makes this weird is that we didn't had to copy the DB when we failed over to alert2002 last week.
[16:51:52] <denisse>	 So I'm not sure what's the root cause of the issue.
[16:52:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 10%: T375050', diff saved to https://phabricator.wikimedia.org/P69285 and previous config saved to /var/cache/conftool/dbconfig/20240918-165232-arnaudb.json
[16:52:38] <stashbot>	 T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050
[16:53:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69286 and previous config saved to /var/cache/conftool/dbconfig/20240918-165327-arnaudb.json
[16:53:32] <stashbot>	 T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104
[16:53:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69287 and previous config saved to /var/cache/conftool/dbconfig/20240918-165332-arnaudb.json
[16:53:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69288 and previous config saved to /var/cache/conftool/dbconfig/20240918-165337-arnaudb.json
[16:53:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69289 and previous config saved to /var/cache/conftool/dbconfig/20240918-165344-arnaudb.json
[16:53:51] <volans>	 denisse: is there some sync mechanism that keeps the db in sync between hosts? or was it tested on that host so that a local db was already there,
[16:53:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69290 and previous config saved to /var/cache/conftool/dbconfig/20240918-165352-arnaudb.json
[16:53:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69291 and previous config saved to /var/cache/conftool/dbconfig/20240918-165357-arnaudb.json
[16:54:02] <volans>	 or just a puppetization error that did it there but not here
[16:54:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69292 and previous config saved to /var/cache/conftool/dbconfig/20240918-165403-arnaudb.json
[16:54:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69293 and previous config saved to /var/cache/conftool/dbconfig/20240918-165407-arnaudb.json
[16:54:11] <mutante>	 we already deleted the emtpy db file and let puppet run again 
[16:54:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69294 and previous config saved to /var/cache/conftool/dbconfig/20240918-165412-arnaudb.json
[16:54:28] <mutante>	 but yea, I guess let's just copy it regardless
[16:54:41] <mutante>	 why not keep the topic data, right?
[16:55:19] <wikibugs>	 (03PS3) 10JMeybohm: Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366)
[16:55:36] <wikibugs>	 (03CR) 10JMeybohm: Fix ferm_status to actually compare rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm)
[16:55:39] <mutante>	 the real question seems what initially creates the tables
[16:56:22] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[16:56:26] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[16:56:50] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[16:56:52] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[16:56:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh)
[16:57:28] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[16:58:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm)
[16:59:33] <denisse>	 volans: There doesn't seem to be a sync mechanism between them. When we failed over to alert2002 last week (same setup) the issue didn't happen, the tables were created correctly and the DB was populated with data.
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700)
[17:02:24] <mutante>	 !log copied vopsbot.db from alert1001 to alert1002; restarted vopsbot
[17:02:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:21] <mutante>	 looks like it's working. swfrench-wmf appears in the topic db
[17:06:03] <wikibugs>	 (03CR) 10Bking: [C:03+1] wdqs max lag: break up extremely long line [alerts] - 10https://gerrit.wikimedia.org/r/1073534 (owner: 10Ryan Kemper)
[17:07:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 15%: T375050', diff saved to https://phabricator.wikimedia.org/P69296 and previous config saved to /var/cache/conftool/dbconfig/20240918-170738-arnaudb.json
[17:07:43] <stashbot>	 T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050
[17:08:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69297 and previous config saved to /var/cache/conftool/dbconfig/20240918-170833-arnaudb.json
[17:08:38] <stashbot>	 T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104
[17:08:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69298 and previous config saved to /var/cache/conftool/dbconfig/20240918-170838-arnaudb.json
[17:08:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69299 and previous config saved to /var/cache/conftool/dbconfig/20240918-170843-arnaudb.json
[17:08:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69300 and previous config saved to /var/cache/conftool/dbconfig/20240918-170849-arnaudb.json
[17:08:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69301 and previous config saved to /var/cache/conftool/dbconfig/20240918-170858-arnaudb.json
[17:09:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69302 and previous config saved to /var/cache/conftool/dbconfig/20240918-170903-arnaudb.json
[17:09:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69303 and previous config saved to /var/cache/conftool/dbconfig/20240918-170909-arnaudb.json
[17:09:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69304 and previous config saved to /var/cache/conftool/dbconfig/20240918-170913-arnaudb.json
[17:09:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69305 and previous config saved to /var/cache/conftool/dbconfig/20240918-170918-arnaudb.json
[17:14:11] <Dreamy_Jazz>	 jouncebot: nowandnext
[17:14:11] <jouncebot>	 For the next 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700)
[17:14:11] <jouncebot>	 In 0 hour(s) and 45 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1800)
[17:17:12] <wikibugs>	 (03PS1) 10Dreamy Jazz: Revert^2 "Create group for assigning checkuser-temporary-account right" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073853 (https://phabricator.wikimedia.org/T369187)
[17:17:24] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Revert^2 "Create group for assigning checkuser-temporary-account right" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073853 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz)
[17:17:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073853 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz)
[17:19:51] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:20:22] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:22:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 25%: T375050', diff saved to https://phabricator.wikimedia.org/P69306 and previous config saved to /var/cache/conftool/dbconfig/20240918-172243-arnaudb.json
[17:22:49] <stashbot>	 T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050
[17:24:12] <wikibugs>	 (03CR) 10Bking: airflow: allow the webserver and scheduler to be selectively deployed (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) (owner: 10Brouberol)
[17:25:40] <wikibugs>	 (03PS1) 10Ssingh: varnish: fix regex for NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/1073854
[17:25:54] <wikibugs>	 (03PS1) 10Xcollazo: Declare stream 'mediawiki.dump.revision_history.reconcile.v1.rc0' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755)
[17:26:30] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: fix regex for NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/1073854 (owner: 10Ssingh)
[17:29:16] <sukhe>	 !log re-enable puppet on A:cp to finish rolling out T368755
[17:29:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:20] <stashbot>	 T368755: Python job that reads from wmf_dumps.wikitext_inconsistent_row and calls EventGate - https://phabricator.wikimedia.org/T368755
[17:29:42] <sukhe>	 that's the wrong one 
[17:29:53] <sukhe>	 !log re-enable puppet on A:cp to finish rolling out T347114
[17:29:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:57] <stashbot>	 T347114: NetworkProbeLimit cookie for Probenet overwritten on every link hover event - https://phabricator.wikimedia.org/T347114
[17:37:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 50%: T375050', diff saved to https://phabricator.wikimedia.org/P69308 and previous config saved to /var/cache/conftool/dbconfig/20240918-173749-arnaudb.json
[17:37:54] <stashbot>	 T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050
[17:39:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073818 (https://phabricator.wikimedia.org/T373976) (owner: 10Elukey)
[17:40:45] <wikibugs>	 (03PS1) 10JMeybohm: wikikube: Remove remaining hiera files and role for non stacked masters [puppet] - 10https://gerrit.wikimedia.org/r/1073857 (https://phabricator.wikimedia.org/T353464)
[17:42:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073802 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[17:44:12] <wikibugs>	 (03PS2) 10JMeybohm: wikikube: Remove remaining hiera files and role for non stacked masters [puppet] - 10https://gerrit.wikimedia.org/r/1073857 (https://phabricator.wikimedia.org/T353464)
[17:44:12] <wikibugs>	 (03PS1) 10JMeybohm: wikikube: Disable requestctl ferm rules and definitions [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366)
[17:44:43] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm)
[17:46:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[17:47:12] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "Create group for assigning checkuser-temporary-account right" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073853 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz)
[17:47:31] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073853|Revert^2 "Create group for assigning checkuser-temporary-account right" (T369187)]]
[17:47:35] <stashbot>	 T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187
[17:49:41] <wikibugs>	 (03PS4) 10JMeybohm: Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366)
[17:49:42] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1073853|Revert^2 "Create group for assigning checkuser-temporary-account right" (T369187)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:49:42] <wikibugs>	 (03PS3) 10JMeybohm: wikikube: Remove remaining hiera files and role for non stacked masters [puppet] - 10https://gerrit.wikimedia.org/r/1073857 (https://phabricator.wikimedia.org/T353464)
[17:49:42] <wikibugs>	 (03PS2) 10JMeybohm: wikikube: Disable requestctl ferm rules and definitions [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366)
[17:51:02] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[17:52:06] <wikibugs>	 (03PS3) 10JMeybohm: wikikube: Disable requestctl ferm rules and definitions [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366)
[17:52:13] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm)
[17:52:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 75%: T375050', diff saved to https://phabricator.wikimedia.org/P69309 and previous config saved to /var/cache/conftool/dbconfig/20240918-175255-arnaudb.json
[17:52:59] <stashbot>	 T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050
[17:55:49] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073853|Revert^2 "Create group for assigning checkuser-temporary-account right" (T369187)]] (duration: 08m 18s)
[17:55:54] <stashbot>	 T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187
[17:56:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm)
[17:56:03] <Dreamy_Jazz>	 Finished my deploys for now
[17:56:17] <Dreamy_Jazz>	 jouncebot: nowandnext
[17:56:17] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700)
[17:56:17] <jouncebot>	 In 0 hour(s) and 3 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1800)
[18:00:05] <jouncebot>	 jnuche and dduvall: Your horoscope predicts another MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1800).
[18:01:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Remove puppetmaster1003 from active Puppet 5 servers" [puppet] - 10https://gerrit.wikimedia.org/r/1073860 (https://phabricator.wikimedia.org/T373888)
[18:04:49] <wikibugs>	 (03CR) 10Dzahn: "This second option seems like it would require some more changes because currently the class httpd is instantiated inside the class profil" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[18:04:57] <wikibugs>	 (03PS3) 10Dzahn: gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804)
[18:05:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[18:05:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on kubernetes2056:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2056 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:06:17] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10158427 (10RobH) So the SEL/idrac logs show no thermal events, and dell support is attempting to deny these support requests.  On checking cp3071, I don't see any thermal events in the logs:   ` r...
[18:08:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 100%: T375050', diff saved to https://phabricator.wikimedia.org/P69310 and previous config saved to /var/cache/conftool/dbconfig/20240918-180800-arnaudb.json
[18:08:06] <stashbot>	 T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050
[18:10:38] <wikibugs>	 (03PS4) 10Dzahn: gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804)
[18:11:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[18:11:52] <wikibugs>	 (03PS5) 10Dzahn: gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804)
[18:12:46] <wikibugs>	 (03CR) 10Scott French: [C:03+1] services: remove old poolcounter nodes from MW's net policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073802 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[18:20:41] <wikibugs>	 (03PS6) 10Dzahn: gerrit::proxy: ensure /var/www/ exists before files under it [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804)
[18:21:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit::proxy: ensure /var/www/ exists before files under it [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[18:23:10] <wikibugs>	 (03PS7) 10Dzahn: gerrit::proxy: ensure /var/www/ exists before files under it [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804)
[18:23:44] <wikibugs>	 (03CR) 10CDanis: [C:03+1] wikikube: Disable requestctl ferm rules and definitions [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm)
[18:24:48] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1073305/4028/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[18:35:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on kubernetes2056:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2056 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:36:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - elastic1089 - https://phabricator.wikimedia.org/T374897#10158542 (10Dzahn)
[18:38:29] <wikibugs>	 06SRE, 10Observability-Metrics, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10158541 (10Dzahn) Would it make sense to have clusters named after a team to group all machines owned by a specific subteam?   Or would that go against the purpose of clusters and th...
[18:45:47] <wikibugs>	 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10158573 (10Dzahn) @Seddon Could you confirm if the google groups work for you and you are receiving mails there?  I think if that's the...
[18:46:04] <wikibugs>	 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10158575 (10Dzahn) 05Open→03In progress
[18:46:16] <wikibugs>	 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10158576 (10Dzahn) a:03Seddon
[19:08:57] <wikibugs>	 (03PS1) 10C. Scott Ananian: Re-order arguments to DataAccess::addTrackingCategory [core] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073871
[19:39:02] <wikibugs>	 (03PS1) 10Sohom Datta: Bring back quality colors before dark mode fixes [extensions/ProofreadPage] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073879 (https://phabricator.wikimedia.org/T375114)
[19:39:03] <wikibugs>	 (03PS1) 10Mforns: Modify service commons-impact-analytics to use data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073880 (https://phabricator.wikimedia.org/T368035)
[19:53:47] <wikibugs>	 (03PS1) 10JHathaway: ci: fix bundle on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1073893
[19:57:41] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] ci: fix bundle on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1073893 (owner: 10JHathaway)
[19:57:58] <wikibugs>	 (03CR) 10Gmodena: Declare stream 'mediawiki.dump.revision_history.reconcile.v1.rc0' (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T2000). nyaa~
[20:00:04] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:54] <wikibugs>	 (03PS1) 10JHathaway: WIP - test [puppet] - 10https://gerrit.wikimedia.org/r/1073896
[20:01:50] <Jdlrobson>	 o/
[20:03:07] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073896 (owner: 10JHathaway)
[20:09:14] <wikibugs>	 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375130 (10phaultfinder) 03NEW
[20:15:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:16:03] <wikibugs>	 (03Abandoned) 10JHathaway: WIP - test [puppet] - 10https://gerrit.wikimedia.org/r/1073896 (owner: 10JHathaway)
[20:17:38] <wikibugs>	 (03PS5) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891)
[20:18:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:25:40] <toyofuku>	 Hi everyone we're gonna do some deploys!
[20:25:47] <toyofuku>	 Jdlrobson: lmk when you're here and ready
[20:26:52] <Jdlrobson>	 toyofuku: ready
[20:26:59] <toyofuku>	 Sounds good
[20:27:16] <toyofuku>	 Any reason they shouldn't all go out in one batch?  I'm guessing not based on my understanding
[20:29:00] <toyofuku>	 While we wait, the song rec of the day is Se Me Olvida by Maisak and Feid
[20:29:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins)
[20:30:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:30:57] <Jdlrobson>	 toyofuku: they can all go out together
[20:31:06] <toyofuku>	 roger that
[20:31:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073836 (https://phabricator.wikimedia.org/T370099) (owner: 10Jdlrobson)
[20:31:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073835 (https://phabricator.wikimedia.org/T374255) (owner: 10Jdlrobson)
[20:31:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073839 (https://phabricator.wikimedia.org/T374654) (owner: 10Jdlrobson)
[20:32:04] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Vector 2022 on several Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073835 (https://phabricator.wikimedia.org/T374255) (owner: 10Jdlrobson)
[20:32:08] <wikibugs>	 (03Merged) 10jenkins-bot: Enable dark mode for all logged in users on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073836 (https://phabricator.wikimedia.org/T370099) (owner: 10Jdlrobson)
[20:32:09] <wikibugs>	 (03Merged) 10jenkins-bot: Limit quick surveys to wikis with messages defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073839 (https://phabricator.wikimedia.org/T374654) (owner: 10Jdlrobson)
[20:32:32] <logmsgbot>	 !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1073836|Enable dark mode for all logged in users on all projects (T370099)]], [[gerrit:1073835|Deploy Vector 2022 on several Wikimedia wikis (T374255)]], [[gerrit:1073839|Limit quick surveys to wikis with messages defined (T374654)]]
[20:32:39] <stashbot>	 T370099: Roll out dark mode to all projects (non-Wikipedia sites, logged-in users) - https://phabricator.wikimedia.org/T370099
[20:32:39] <stashbot>	 T374255: Deploy Vector 2022 on small wikis - https://phabricator.wikimedia.org/T374255
[20:32:40] <stashbot>	 T374654: Log messages at ERROR level on QuickSurvey channel: "Bad survey configuration: The XXX external survey must have a secure url." - https://phabricator.wikimedia.org/T374654
[20:35:00] <logmsgbot>	 !log toyofuku@deploy1003 toyofuku, jdlrobson: Backport for [[gerrit:1073836|Enable dark mode for all logged in users on all projects (T370099)]], [[gerrit:1073835|Deploy Vector 2022 on several Wikimedia wikis (T374255)]], [[gerrit:1073839|Limit quick surveys to wikis with messages defined (T374654)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:35:09] <toyofuku>	 Jdlrobson: we're on test servers!
[20:35:55] <toyofuku>	 While we wait for him to test, another banger: Yayo by Rema
[20:36:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10158825 (10phaultfinder)
[20:38:49] <Jdlrobson>	 (toyofuku: looking)
[20:38:57] <toyofuku>	 ty ty
[20:40:23] <Jdlrobson>	 ok all LGTM toyofuku please sync!
[20:40:29] <toyofuku>	 on it
[20:40:31] <logmsgbot>	 !log toyofuku@deploy1003 toyofuku, jdlrobson: Continuing with sync
[20:43:07] <wikibugs>	 (03PS6) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891)
[20:45:15] <icinga-wm_>	 PROBLEM - Ensure acme-chief-api is running on acmechief1002 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief
[20:45:24] <logmsgbot>	 !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073836|Enable dark mode for all logged in users on all projects (T370099)]], [[gerrit:1073835|Deploy Vector 2022 on several Wikimedia wikis (T374255)]], [[gerrit:1073839|Limit quick surveys to wikis with messages defined (T374654)]] (duration: 12m 52s)
[20:45:31] <stashbot>	 T370099: Roll out dark mode to all projects (non-Wikipedia sites, logged-in users) - https://phabricator.wikimedia.org/T370099
[20:45:31] <stashbot>	 T374255: Deploy Vector 2022 on small wikis - https://phabricator.wikimedia.org/T374255
[20:45:32] <stashbot>	 T374654: Log messages at ERROR level on QuickSurvey channel: "Bad survey configuration: The XXX external survey must have a secure url." - https://phabricator.wikimedia.org/T374654
[20:45:59] <toyofuku>	 Jdlrobson: all done!
[20:46:15] <icinga-wm_>	 RECOVERY - Ensure acme-chief-api is running on acmechief1002 is OK: PROCS OK: 1 process with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief
[20:46:24] <Jdlrobson>	 thanks toyofuku !
[20:46:26] <Jdlrobson>	 Appreciated!
[20:46:34] <wikibugs>	 (03PS1) 10Scott French: mw-(api-ext|web): scale up to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073904 (https://phabricator.wikimedia.org/T371273)
[20:46:37] <toyofuku>	 🫡
[20:48:48] <wikibugs>	 (03CR) 10Scott French: "This will be merged and applied ahead of depooling the RO services in codfw tomorrow. Thanks in advance for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073904 (https://phabricator.wikimedia.org/T371273) (owner: 10Scott French)
[20:52:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374741#10158902 (10Jhancock.wm) a:03Papaul This has been phsyically decommed, and offline in netbox.  @papaul, it is ready for you to remove the e...
[20:52:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374741#10158910 (10Jhancock.wm)
[20:55:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins)
[20:55:31] <wikibugs>	 (03PS7) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891)
[20:56:26] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[20:58:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[4-5] - https://phabricator.wikimedia.org/T372512#10158928 (10Jhancock.wm) a:03Jhancock.wm
[20:58:52] <Amir1>	 swfrench-wmf: about to start the circular replication cookbook
[20:59:12] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw
[20:59:34] <swfrench-wmf>	 Amir1: ack, thanks for the heads-up!
[20:59:55] <swfrench-wmf>	 is this the first time it's been used "for real"?
[20:59:59] <Amir1>	 yes
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T2100)
[21:00:38] * swfrench-wmf grabs popcorn
[21:00:44] <jynus>	 looking good s1
[21:00:59] <Amir1>	 s1 is good
[21:01:04] <Amir1>	 I wait a bit before next one
[21:01:09] <Amir1>	 in case things break
[21:01:13] <jynus>	 yep
[21:03:29] <Amir1>	 btw, these replicas are depooled. Some I know why but some I'm not sure: https://phabricator.wikimedia.org/P69307
[21:03:45] <Amir1>	 logs look good
[21:04:19] <jynus>	 let's add those to review, will ask A. too, as there is a lot of ongoing decoms etc
[21:04:48] <Amir1>	 thanks
[21:04:51] <Amir1>	 https://www.irccloud.com/pastebin/VtxVUtCO/
[21:05:11] <jynus>	 certainly we should do a general review of hosts and weights
[21:05:17] <Amir1>	 this is the most important part for me. Root is only us, but if it's RW, then mw might write stuff
[21:05:21] <jynus>	 manuel used to do those before switch
[21:05:33] <Amir1>	 yeah
[21:05:37] <Amir1>	 moving on to s2
[21:06:29] <jynus>	 I checked also mw logs, nothing worring there
[21:09:59] <Amir1>	 the errors in s2 are because of this: https://phabricator.wikimedia.org/T374852#10158957
[21:10:22] <wikibugs>	 (03CR) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins)
[21:11:16] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2138.codfw.wmnet - https://phabricator.wikimedia.org/T374852#10158957 (10Ladsgroup) You need to remove them from orchestrator too: https://wikitech.wikimedia.org/wiki/MariaDB/Decommissioning_a_DB_Host#Remove_host_from_orchestrat...
[21:11:30] <wikibugs>	 (03PS8) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891)
[21:13:41] <jynus>	 Amir1: just to be clear, you mean orch errors, no script/app errors, right?
[21:13:47] <Amir1>	 yeah
[21:13:52] <Amir1>	 this
[21:13:55] <jynus>	 ok, thanks
[21:14:03] <Amir1>	 https://usercontent.irccloud-cdn.com/file/6txoYbrg/grafik.png
[21:14:07] <jynus>	 that makes me not worry
[21:15:01] <wikibugs>	 (03PS7) 10Andrea Denisse: alert: Ensure Prometheus Alertmanager starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/1073903 (https://phabricator.wikimedia.org/T375138)
[21:15:01] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1073903/4035/" [puppet] - 10https://gerrit.wikimedia.org/r/1073903 (https://phabricator.wikimedia.org/T375138) (owner: 10Andrea Denisse)
[21:15:20] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2138.codfw.wmnet - https://phabricator.wikimedia.org/T374852#10158977 (10Jhancock.wm) 05Open→03Resolved
[21:15:42] <Amir1>	 s4 now
[21:18:47] <Amir1>	 s5
[21:19:03] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2137.codfw.wmnet - https://phabricator.wikimedia.org/T374851#10158988 (10Jhancock.wm) 05Open→03Resolved
[21:20:29] <Amir1>	 s6
[21:23:55] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:24:04] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2127.codfw.wmnet - https://phabricator.wikimedia.org/T374849#10158999 (10Jhancock.wm) 05Open→03Resolved
[21:27:09] <jynus>	 only x1 left?
[21:28:01] <Amir1>	 x1 is done
[21:28:06] <Amir1>	 RW ES sections left
[21:28:19] <Amir1>	 es6 and es7
[21:28:41] <jynus>	 there may be an issue on x1
[21:29:00] <jynus>	 the primary master is not replicating
[21:29:29] <Amir1>	 yeah and es6 too
[21:30:08] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2125.codfw.wmnet - https://phabricator.wikimedia.org/T374848#10159021 (10Jhancock.wm) 05Open→03Resolved
[21:30:09] <jynus>	 not a breaking thing, but let me help investigate (I won't touch anything)
[21:30:20] <Amir1>	 feel free to touch anything :D
[21:30:41] <wikibugs>	 (03CR) 10Bking: [C:03+1] airflow: allow the webserver and scheduler to be selectively deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) (owner: 10Brouberol)
[21:31:47] <jynus>	 Could not execute Update_rows_v1 event on table heartbeat.heartbeat; Can't find record in 'heartbeat'
[21:31:57] <jynus>	 heartbeat table wasn't properly cleaned up
[21:33:24] <jynus>	 I know why
[21:33:37] <jynus>	 x1 and es use row based replicatin
[21:34:05] <jynus>	 this means that the REPLACE gets translated into update row
[21:34:20] <jynus>	 but that row doesn't exist on eqiad
[21:34:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2121.codfw.wmnet - https://phabricator.wikimedia.org/T374845#10159058 (10Jhancock.wm) 05Open→03Resolved
[21:34:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2122.codfw.wmnet - https://phabricator.wikimedia.org/T374846#10159062 (10Jhancock.wm) 05Open→03Resolved a:05ABran-WMF→03Jhancock.wm
[21:35:14] <jynus>	 this is an easy fix, but given nothing is broken, let me make sure I fix it rightly and I don't break the eqiad replicas (I just need to insert a row on eqiad master)
[21:35:17] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2124.codfw.wmnet - https://phabricator.wikimedia.org/T374847#10159060 (10Jhancock.wm) 05Open→03Resolved
[21:39:50] <jynus>	 Amir1: ok for me to apply the change to x1 master eqiad and restart replication ?
[21:39:57] <Amir1>	 yeah sure
[21:40:26] <jynus>	 x1 done
[21:40:50] <Amir1>	 Thanks!
[21:41:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159081 (10phaultfinder)
[21:42:19] <icinga-wm_>	 PROBLEM - MariaDB Replica SQL: x1 on db2196 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table heartbeat.heartbeat: Duplicate entry 180360966 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1220-bin.015294, end_log_pos 946695232 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:43:11] <icinga-wm_>	 PROBLEM - Disk space on seaborgium is CRITICAL: DISK CRITICAL - free space: / 718 MB (3% inode=92%): /tmp 718 MB (3% inode=92%): /var/tmp 718 MB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=seaborgium&var-datasource=eqiad+prometheus/ops
[21:43:43] <domas>	 oh no
[21:43:57] <domas>	 heartbeat dupe is weird!
[21:44:23] <swfrench-wmf>	 jynus: is the alert about x1 just a delayed response to what you've now repaired manually?
[21:44:34] <jynus>	 yeah, but it is causing fallout
[21:45:37] <wikibugs>	 (03PS1) 10JHathaway: ci: upgrade to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490)
[21:45:50] <swfrench-wmf>	 ack, let me know if you need more hands / eyes on anything
[21:46:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ci: upgrade to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway)
[21:47:15] <wikibugs>	 (03PS2) 10JHathaway: ci: upgrade to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490)
[21:47:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ci: upgrade to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway)
[21:48:19] <icinga-wm_>	 RECOVERY - MariaDB Replica SQL: x1 on db2196 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:48:21] <Amir1>	 okay, there are two rows in eqiad master
[21:48:25] <Amir1>	 but one row in codfw
[21:48:27] <jynus>	 I have deployed a temporary fix, which is replicate-wild-ignore-table=heartbeat.%
[21:48:44] <jynus>	 and that sould work for mediawiki, but we are in a weird state
[21:49:39] <jynus>	 the solution would be easy if we were on statement
[21:49:52] <jynus>	 because of the replaces
[21:51:50] <jynus>	 I wonder how to best move forward with es
[21:53:23] <jynus>	 because what I've done for x1 is just delay the issue until switchover
[21:55:31] <mutante>	 !log seaborgium - apt-get clean (disk space before: 98% used, now: 76% used, was alerting)
[21:55:33] <jynus>	 I think the right way to fix it is to undo the circular replication and insert the row or insert it witout logging
[21:55:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:13] <jynus>	 Amir1: on your side is everything finished?
[21:57:21] <Amir1>	 I haven't done s7 yet
[21:57:25] <jynus>	 es7
[21:57:31] <Amir1>	 should I do and then we revert it 
[21:57:33] <Amir1>	 yeah sorry
[21:57:53] <jynus>	 do you know the funny thing- this happened because heartbeat was cleaned :-D
[21:58:16] <jynus>	 if it was "dirty" it would have worked
[21:58:56] <Amir1>	 :((
[22:00:32] <jynus>	 let me try to fix es6 in a cleaner way
[22:00:48] <jynus>	 by inserting without binlog a new codfw row
[22:02:23] <Amir1>	 thanks
[22:03:11] <icinga-wm_>	 RECOVERY - Disk space on seaborgium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=seaborgium&var-datasource=eqiad+prometheus/ops
[22:05:52] <jynus>	 yeah, that works for es6
[22:06:09] <jynus>	 will do it now for es7 ahead of the circular
[22:06:26] <Amir1>	 Thanks
[22:06:36] <Amir1>	 let me know once you're done and I will run it
[22:06:58] <jynus>	 yep, taking my time, to make sure I dont break stuff
[22:07:44] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mw-(api-ext|web): scale up to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073904 (https://phabricator.wikimedia.org/T371273) (owner: 10Scott French)
[22:11:08] <jynus>	 Amir1: done, you should be good to go and it should just work
[22:11:18] <Amir1>	 going
[22:11:32] <jynus>	 waiting to check everthing is ok before fixing x1 for real
[22:11:55] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from eqiad to codfw
[22:12:05] <Amir1>	 done ^
[22:12:16] <jynus>	 and this time it looks good
[22:12:50] <jynus>	 ok, the fixing x1 codfw to remove the ignore table
[22:13:16] <jynus>	 will do the same thing, apply the change without logs on all codfw hosts
[22:17:47] <Amir1>	 thanks
[22:18:50] <jynus>	 I should have logged all of this
[22:19:12] <jynus>	 !log inserting without binlog missing heartbeat reecod on x1 codfw hosts
[22:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:36] <jynus>	 and we should be ok
[22:23:03] <jynus>	 as in healthy/as expected/no hidden bomb
[22:23:28] <jynus>	 and as we should after circulat replication everywhere it should
[22:24:31] <jynus>	 my first thought of why this happened is that either there was something for ROW that we missed or this was a "hidden" bomb after cleaning up heartbeat, that only showed up in this one
[22:25:10] <jynus>	 and we should either not cleanup ROW replicas or add the record beforehand
[22:25:26] <jynus>	 maybe it was something else, but this is my first impression
[22:25:33] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker1007.eqiad.wmnet, mw1477.eqiad.wmnet, parse1013.eqiad.wmnet, mw1380.eqiad.wmnet, mw1448.eqiad.wmnet, parse1007.eqiad.wmnet, mw1451.eqiad.wmnet, wikikube-worker1020.eqiad.wmnet, mw1367.eqiad.wmnet, mw1475.eqiad.wmnet, mw1459.eqiad.wmnet, parse1011.eqiad.wmnet, mw1476.eqiad.wmnet, kubernetes1062.eqiad.wmnet, k
[22:25:34] <icinga-wm_>	 1022.eqiad.wmnet, mw1384.eqiad.wmnet, mw1479.eqiad.wmnet, mw1378.eqiad.wmnet, mw1462.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, mw1388.eqiad.wmnet, mw1480.eqiad.wmnet, mw1482.eqiad.wmnet, parse1009.eqiad.wmnet, kubernetes1040.eqiad.wmnet, mw1405.eqiad.wmnet, mw1495.eqiad.wmnet, kubernetes1030.eqiad.wmnet, kubernetes1038.eqiad.wmnet, mw1424.eqiad.wmnet, mw1461.eqiad.wmnet, mw1488.eqiad.wmnet, parse1010.eqiad.wmnet, wikikube-work
[22:25:34] <icinga-wm_>	 iad.wmnet, mw1465.eqiad.wmnet, wikikube-worker1018.eqiad.wmnet, mw1389.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, parse1012.eqiad.wmnet, wikikube-worker1025.eqiad.wmnet, mw149 https://wikitech.wikimedia.org/wiki/PyBal
[22:25:37] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker1007.eqiad.wmnet, mw1451.eqiad.wmnet, mw1433.eqiad.wmnet, mw1380.eqiad.wmnet, mw1462.eqiad.wmnet, mw1457.eqiad.wmnet, mw1455.eqiad.wmnet, mw1475.eqiad.wmnet, mw1374.eqiad.wmnet, wikikube-worker1013.eqiad.wmnet, parse1011.eqiad.wmnet, mw1439.eqiad.wmnet, kubernetes1011.eqiad.wmnet, wikikube-worker1029.eqiad.w
[22:25:37] <icinga-wm_>	 386.eqiad.wmnet, mw1384.eqiad.wmnet, parse1013.eqiad.wmnet, mw1479.eqiad.wmnet, mw1470.eqiad.wmnet, mw1390.eqiad.wmnet, mw1430.eqiad.wmnet, parse1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, mw1495.eqiad.wmnet, parse1014.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1463.eqiad.wmnet, mw1435.eqiad.wmnet, mw1424.eqiad.wmnet, mw1454.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1017.eqiad.wmnet, wikikube-w
[22:25:37] <icinga-wm_>	 .eqiad.wmnet, mw1477.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, parse1012.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1496.eqiad.wmnet, kubernetes1060.eqiad.wmnet, mw1449.eqiad https://wikitech.wikimedia.org/wiki/PyBal
[22:26:12] <jynus>	 Amir1: my guess is you would have encountered this no matter the method of setting up circular replication
[22:26:41] <Amir1>	 sigh
[22:27:22] <Amir1>	 joy
[22:27:33] <Amir1>	 the lvs issue can't be us. Right?
[22:28:12] <swfrench-wmf>	 looking at that now
[22:28:18] <cdanis>	 Amir1: no, that's likely an issue with the eventstreams service
[22:28:27] <swfrench-wmf>	 looks like something has gone sideways with eventstreams, yeah
[22:28:31] <jynus>	 Amir I am filing this empty https://phabricator.wikimedia.org/T375144
[22:28:35] <jynus>	 and going to bed :-D
[22:28:45] <Amir1>	 I go to the airport now
[22:28:54] <swfrench-wmf>	 thank you both for working on this, Amir1 and jynus!
[22:28:56] <Amir1>	 ping me if something breaks really bad
[22:30:32] <jynus>	 the surprising thing about this is why this didn't break before, not why it broke today :-D
[22:35:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[22:36:12] <swfrench-wmf>	 !incidents
[22:36:13] <sirenbot>	 5258 (UNACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[22:36:19] <swfrench-wmf>	 !ack 5258
[22:36:19] <sirenbot>	 5258 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[22:40:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[22:40:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventstreams.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:40:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventstreams.svc.eqiad.wmnet:4892 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:43:02] <cdanis>	 !incidents
[22:43:02] <sirenbot>	 5259 (UNACKED)  ATSBackendErrorsHigh cache_text sre (eventstreams.discovery.wmnet eqiad)
[22:43:02] <sirenbot>	 5258 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[22:43:09] <cdanis>	 !ack 5259
[22:43:10] <sirenbot>	 5259 (ACKED)  ATSBackendErrorsHigh cache_text sre (eventstreams.discovery.wmnet eqiad)
[22:43:17] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: sync
[22:43:44] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync
[22:44:34] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:44:37] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:45:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventstreams.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:45:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventstreams.svc.eqiad.wmnet:4892 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:03:33] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers parse1011.eqiad.wmnet, mw1419.eqiad.wmnet, mw1386.eqiad.wmnet, mw1470.eqiad.wmnet, mw1462.eqiad.wmnet, mw1388.eqiad.wmnet, mw1480.eqiad.wmnet, parse1009.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1435.eqiad.wmnet, mw1488.eqiad.wmnet, mw1454.eqiad.wmnet, parse1010.eqiad.wmnet, mw1425.eqiad.wmnet, kubernetes1012.eqiad.wmnet,
[23:03:34] <icinga-wm_>	 qiad.wmnet, kubernetes1033.eqiad.wmnet, kubernetes1014.eqiad.wmnet, mw1367.eqiad.wmnet, mw1486.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1360.eqiad.wmnet, mw1458.eqiad.wmnet, mw1468.eqiad.wmnet, mw1464.eqiad.wmnet, parse1019.eqiad.wmnet, kubernetes1056.eqiad.wmnet, mw1472.eqiad.wmnet, kubernetes1035.eqiad.wmnet, mw1379.eqiad.wmnet, parse1007.eqiad.wmnet, wikikube-worker1020.eqiad.wmnet, wikikube-worker1022.eqiad.wmnet, mw1378.eqiad.wmne
[23:03:34] <icinga-wm_>	 .eqiad.wmnet, mw1482.eqiad.wmnet, mw1357.eqiad.wmnet, mw1496.eqiad.wmnet, kubernetes1060.eqiad.wmnet, kubernetes1020.eqiad.wmnet, mw1397.eqiad.wmnet, kubernetes1027.eqiad.wmnet, mw1414. https://wikitech.wikimedia.org/wiki/PyBal
[23:03:37] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers parse1011.eqiad.wmnet, mw1380.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1419.eqiad.wmnet, mw1434.eqiad.wmnet, mw1479.eqiad.wmnet, kubernetes1023.eqiad.wmnet, mw1462.eqiad.wmnet, kubernetes1030.eqiad.wmnet, parse1021.eqiad.wmnet, mw1424.eqiad.wmnet, mw1393.eqiad.wmnet, mw1488.eqiad.wmnet, mw1454.eqiad.wmnet, parse1005.eqia
[23:03:37] <icinga-wm_>	 wikikube-worker1003.eqiad.wmnet, mw1370.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1425.eqiad.wmnet, mw1395.eqiad.wmnet, mw1465.eqiad.wmnet, kubernetes1014.eqiad.wmnet, mw1466.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1369.eqiad.wmnet, mw1469.eqiad.wmnet, mw1486.eqiad.wmnet, wikikube-worker1001.eqiad.wmnet, mw1458.eqiad.wmnet, parse1001.eqiad.wmnet, mw1453.eqiad.wmnet, mw1468.eqiad.wmnet, wikikube-worker1010.eqiad.wmnet, kubernetes1015.
[23:03:37] <icinga-wm_>	 et, kubernetes1008.eqiad.wmnet, kubernetes1031.eqiad.wmnet, mw1464.eqiad.wmnet, mw1391.eqiad.wmnet, wikikube-worker1028.eqiad.wmnet, kubernetes1056.eqiad.wmnet, parse1006.eqiad.wmnet, p https://wikitech.wikimedia.org/wiki/PyBal
[23:14:37] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:15:33] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:16:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10159193 (10Papaul) @Dwisehaupt hello since we decommissioned frban2001 is it possible for you to downtime and power down pay-lb2001 for us tomorrow...
[23:30:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159198 (10phaultfinder)
[23:38:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073911
[23:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073911 (owner: 10TrainBranchBot)
[23:59:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10159218 (10Dwisehaupt) @Papaul All set. Powered down and set a downtime for 26 hours.