[00:10:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073570 (owner: 10TrainBranchBot) [00:11:29] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on gitlab2002.wikimedia.org with reason: version upgrade [00:11:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on gitlab2002.wikimedia.org with reason: version upgrade [00:54:51] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:54:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:55:09] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:24:23] PROBLEM - Uncommitted DNS changes in Netbox on netbox1003 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [01:25:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10155686 (10phaultfinder) [01:53:57] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10155692 (10Papaul) I think replacing the pfw first will be a good idea since we are not changing any configuration on them but just the name and less... [01:56:38] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10155693 (10Papaul) [01:57:16] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10155694 (10Papaul) [02:13:40] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10155696 (10Papaul) I update the diagram again since we will not be using VC. {F57520229} [02:16:05] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10155698 (10Papaul) While working on setting up the new fasw2-c8-codfw I realized that fpc0 has interface ge-0/0/47 connected to fmsw-c8-codfw... [02:43:55] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:55] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:20:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:12:09] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1073587 (https://phabricator.wikimedia.org/T375047) [05:13:45] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:13:51] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s6 T375047 [05:24:03] T375047: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T375047 [05:24:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s6 T375047 [05:24:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2214 with weight 0 T375047', diff saved to https://phabricator.wikimedia.org/P69240 and previous config saved to /var/cache/conftool/dbconfig/20240918-052446-arnaudb.json [05:29:21] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1073587 (https://phabricator.wikimedia.org/T375047) (owner: 10Gerrit maintenance bot) [05:30:34] !log Starting s6 codfw failover from db2129 to db2214 - T375047 [05:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:38] T375047: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T375047 [05:31:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2214 to s6 primary T375047', diff saved to https://phabricator.wikimedia.org/P69241 and previous config saved to /var/cache/conftool/dbconfig/20240918-053115-arnaudb.json [05:33:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T375047', diff saved to https://phabricator.wikimedia.org/P69242 and previous config saved to /var/cache/conftool/dbconfig/20240918-053357-arnaudb.json [05:36:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s4 T374804 [05:36:25] T374804: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T374804 [05:36:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2179 with weight 0 T374804', diff saved to https://phabricator.wikimedia.org/P69243 and previous config saved to /var/cache/conftool/dbconfig/20240918-053633-arnaudb.json [05:36:37] PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:37:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T374804 [05:38:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2179 from API/vslow/dump T374804', diff saved to https://phabricator.wikimedia.org/P69244 and previous config saved to /var/cache/conftool/dbconfig/20240918-053807-arnaudb.json [05:39:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:39:49] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:42:45] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 9.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:42:45] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.731 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:43:13] !log Starting s4 codfw failover from db2140 to db2179 - T374804 [05:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:17] T374804: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T374804 [05:43:31] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1073035 (https://phabricator.wikimedia.org/T374804) (owner: 10Gerrit maintenance bot) [05:45:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2179 to s4 primary T374804', diff saved to https://phabricator.wikimedia.org/P69245 and previous config saved to /var/cache/conftool/dbconfig/20240918-054515-arnaudb.json [05:47:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374804', diff saved to https://phabricator.wikimedia.org/P69246 and previous config saved to /var/cache/conftool/dbconfig/20240918-054729-arnaudb.json [05:48:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T374807 [05:48:42] T374807: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T374807 [05:49:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2220 with weight 0 T374807', diff saved to https://phabricator.wikimedia.org/P69247 and previous config saved to /var/cache/conftool/dbconfig/20240918-054909-arnaudb.json [05:49:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2220 from API/vslow/dump T374807', diff saved to https://phabricator.wikimedia.org/P69248 and previous config saved to /var/cache/conftool/dbconfig/20240918-054921-arnaudb.json [05:49:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T374807 [06:01:38] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1073039 (https://phabricator.wikimedia.org/T374807) (owner: 10Gerrit maintenance bot) [06:02:37] RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:02:53] !log Starting s7 codfw failover from db2218 to db2220 - T374807 [06:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:57] T374807: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T374807 [06:03:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2220 to s7 primary T374807', diff saved to https://phabricator.wikimedia.org/P69249 and previous config saved to /var/cache/conftool/dbconfig/20240918-060332-arnaudb.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374807', diff saved to https://phabricator.wikimedia.org/P69250 and previous config saved to /var/cache/conftool/dbconfig/20240918-060549-arnaudb.json [06:07:10] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1073699 (https://phabricator.wikimedia.org/T375050) [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:39] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10155887 (10ABran-WMF) all needed switchover prior to tonight have been done. I'll run T375050 as soon as this is done because circular r... [06:39:14] !log installing curl security updates [06:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:03] (03PS6) 10Gmodena: ds8-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) [06:44:47] (03CR) 10Gmodena: ds8-k8s-service: add values for dumps2 job. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [06:50:33] (03PS1) 10Muehlenhoff: Switch the deployment role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073704 (https://phabricator.wikimedia.org/T349619) [06:53:33] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS7195/IPv4: Connect - EdgeUno, AS7195/IPv6: Connect - EdgeUno https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:07:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:27] (03CR) 10Hashar: [C:03+1] "See my comment at T359795#10148316 , the manual `update-alternatives` would be overridden by the next Puppet run. But overall I think it" [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [07:12:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [07:12:43] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10155908 (10Vgutierrez) Answering here @RobH question: >Hey I made some assumptions on the cp hosts troubleshooting but should check with you: Those hosts are under the same weight conditions as al... [07:13:55] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:17:33] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10155921 (10MoritzMuehlenhoff) [07:29:41] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 645, down: 83, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:30:21] (03CR) 10Muehlenhoff: "profile::java takes care of setting the alternative as well, since L32 in "class java", the default JRE/JDK is the first Java version defi" [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [07:31:03] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10155956 (10MoritzMuehlenhoff) [07:32:41] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS7195/IPv4: Connect - EdgeUno, AS7195/IPv6: Connect - EdgeUno https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:33:04] !log volans@cumin1002 START - Cookbook sre.dns.netbox [07:35:39] (03CR) 10Hashar: [C:04-1] "There is `class { 'httpd': }` defined above which does an `ensure_packages('apache2')` and should thus install the `apache2` package befor" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [07:35:48] (03CR) 10DCausse: flink-app: customize calico label selector (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [07:37:29] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 680, down: 48, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:37:45] jouncebot: next [07:37:45] In 0 hour(s) and 22 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T0800) [07:38:16] (03CR) 10DCausse: [C:03+1] Add a private variant of the cirrus update stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073565 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson) [07:39:38] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10155971 (10cmooney) >>! In T373104#10147494, @Jelto wrote: > `gitlab-runner2004` is a special purpose runner, so if we depool the runner... [07:40:04] (03CR) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol) [07:40:48] (03CR) 10Elukey: "recheck" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [07:40:57] (03CR) 10Elukey: [V:03+2 C:03+2] debian: update the target distribution to bookworm-wikimedia [debs/helm3] - 10https://gerrit.wikimedia.org/r/1073160 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [07:42:01] 06SRE, 06serviceops, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10155980 (10elukey) 05Open→03Resolved [07:43:35] !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fixed asset tag for db1179 - volans@cumin1002" [07:44:03] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10155983 (10elukey) I had a chat with Filippo, the keyholder-proxy is not the daemon that needs re-arming when restarted, so it can be done anytime withou... [07:45:09] !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fixed asset tag for db1179 - volans@cumin1002" [07:45:09] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:48:35] (03PS1) 10Volans: netbox: notify dcops for uncommitted DNS changes [puppet] - 10https://gerrit.wikimedia.org/r/1073732 [07:49:23] RECOVERY - Uncommitted DNS changes in Netbox on netbox1003 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [07:53:50] (03PS2) 10Brouberol: cloudnative-pg-cluster: set sane defaults values for PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073392 (https://phabricator.wikimedia.org/T372278) [07:55:36] (03CR) 10DCausse: [C:03+1] "we might need to change the wikidata maxlag propagation bits as well (https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikidata.org/+/0" [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [07:56:37] (03CR) 10DCausse: [C:03+1] wdqs max lag: target specific port [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [07:56:40] (03PS1) 10Elukey: role::puppetserver: add admin groups config [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023) [07:58:25] (03CR) 10Elukey: [C:04-1] "sigh https://puppet-compiler.wmflabs.org/output/1073733/4009/puppetserver1001.eqiad.wmnet/change.puppetserver1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:00:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10155999 (10cmooney) 05Open→03Resolved a:03cmooney [08:00:49] good morning, train will rollout in a few minutes [08:06:45] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073734 (https://phabricator.wikimedia.org/T373642) [08:06:46] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073734 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [08:07:33] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073734 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [08:08:23] (03CR) 10Ayounsi: [C:03+1] netbox: notify dcops for uncommitted DNS changes [puppet] - 10https://gerrit.wikimedia.org/r/1073732 (owner: 10Volans) [08:09:42] (03PS2) 10Stevemunene: hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) [08:09:48] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [08:11:04] (03CR) 10Muehlenhoff: role::puppetserver: add admin groups config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:11:50] (03CR) 10Elukey: [C:04-1] "So profile::puppetserver::git defines sudo::user, that in turn creates /etc/sudoers.d. The same file is created by profile::admins -> sudo" [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:12:34] (03CR) 10Muehlenhoff: role::puppetserver: add admin groups config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:13:09] (03CR) 10Filippo Giunchedi: [C:03+2] corto: force directory removal [puppet] - 10https://gerrit.wikimedia.org/r/1073412 (owner: 10Filippo Giunchedi) [08:14:49] (03CR) 10Elukey: [C:03+2] services: remove old poolcounter netpolicies for Thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073164 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [08:14:55] (03CR) 10Brouberol: ds8-k8s-service: add values for dumps2 job. (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [08:15:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:55] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.23 refs T373642 [08:15:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet [08:16:02] T373642: 1.43.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T373642 [08:16:15] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10156095 (10ops-monitoring-bot) Draining ganeti2017.codfw.wmnet of running VMs [08:17:57] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073503 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [08:18:01] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073392 (https://phabricator.wikimedia.org/T372278) (owner: 10Brouberol) [08:18:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073502 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [08:18:57] (03CR) 10Stevemunene: [C:03+1] "looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073464 (owner: 10Brouberol) [08:20:43] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:20:44] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:20:46] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:20:46] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:21:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet [08:21:44] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:21:44] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:22:02] train needs to be rolled back [08:22:11] :( [08:22:38] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073736 (https://phabricator.wikimedia.org/T373642) [08:22:40] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073736 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [08:22:47] PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:23:24] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073736 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [08:23:31] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073445 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [08:23:44] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:23:47] RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:23:56] (03CR) 10Brouberol: [C:03+2] airflow: ensure each airflow release store logs to a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073445 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [08:24:15] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: grant the deploy user the ability to create manual backups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073464 (owner: 10Brouberol) [08:24:43] (03CR) 10Hashar: [C:04-1] "And in Puppet state files, the `apache2` install is ordered after `/var/www/robots.txt`:" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [08:24:43] PROBLEM - nova-compute proc maximum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:24:44] PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:24:45] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:24:47] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:25:18] (03CR) 10Hashar: [C:03+1] Switch the deployment role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073704 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:25:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:25:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:25:43] PROBLEM - nova-compute proc maximum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:26:43] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:26:44] (03CR) 10Hashar: [C:03+1] deployment servers: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff) [08:28:04] (03CR) 10Hashar: [C:03+1] "Once applied, I can do a dummy deployment on a simple repository such as `integration/docroot` to validate everything still works :)" [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff) [08:28:19] (03CR) 10Muehlenhoff: [C:03+2] Switch the deployment role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073704 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:28:37] PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:29:43] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:30:23] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.23 refs T373642 [08:30:27] T373642: 1.43.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T373642 [08:30:44] PROBLEM - nova-compute proc maximum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:31:43] RECOVERY - nova-compute proc maximum on cloudvirt1039 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:31:44] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:32:13] !log install openjdk-17-jdk on puppetserver1002 to get some useful tools like jmap - T373527 [08:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:17] T373527: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527 [08:32:54] (03CR) 10Btullis: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol) [08:33:24] PROBLEM - nova-compute proc maximum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:33:24] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073392 (https://phabricator.wikimedia.org/T372278) (owner: 10Brouberol) [08:33:44] PROBLEM - nova-compute proc maximum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:33:48] (03CR) 10Btullis: [C:03+1] cloudnative-pg: grant the deploy user the ability to create manual backups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073464 (owner: 10Brouberol) [08:33:51] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10156154 (10MoritzMuehlenhoff) [08:34:03] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: set sane defaults values for PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073392 (https://phabricator.wikimedia.org/T372278) (owner: 10Brouberol) [08:35:38] (03CR) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol) [08:35:43] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:35:44] RECOVERY - nova-compute proc maximum on cloudvirt1031 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:36:34] 06SRE, 10SRE-Access-Requests: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060 (10Cyndymediawiksim) 03NEW [08:38:47] (03PS4) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) [08:39:00] (03CR) 10Brouberol: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol) [08:41:40] !log centrallog2002 upgrade to bookworm in progress https://phabricator.wikimedia.org/T353912 [08:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:44] RECOVERY - nova-compute proc maximum on cloudvirt1043 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:43:44] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:43:44] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:43:45] RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:45:44] RECOVERY - nova-compute proc maximum on cloudvirt1048 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:45:44] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:46:43] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:46:47] PROBLEM - nova-compute proc minimum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:46:47] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:46:48] PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:47:44] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:47:45] PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:47:47] PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:48:43] PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:48:44] PROBLEM - nova-compute proc minimum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:48:44] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:48:45] PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:49:47] PROBLEM - nova-compute proc minimum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:29] 06SRE, 10SRE-Access-Requests: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10156237 (10DMburugu) I approve this [08:50:37] PROBLEM - nova-compute proc minimum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:38] PROBLEM - nova-compute proc minimum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:45] PROBLEM - nova-compute proc maximum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:47] PROBLEM - nova-compute proc maximum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:48] PROBLEM - nova-compute proc maximum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:48] PROBLEM - nova-compute proc maximum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:44] PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:44] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:44] PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:45] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:46] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:47] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:48] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:49] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:50] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:51] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:52] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:53] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:54] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:55] RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:56] PROBLEM - nova-compute proc maximum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:57] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:58] PROBLEM - nova-compute proc maximum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:59] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:52:00] PROBLEM - nova-compute proc maximum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:52:44] PROBLEM - nova-compute proc maximum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:52:45] PROBLEM - nova-compute proc maximum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:52:45] PROBLEM - nova-compute proc maximum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:52:50] PROBLEM - nova-compute proc minimum on cloudvirt1065 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:52:51] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable Community Updates module in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073739 (https://phabricator.wikimedia.org/T374577) [08:53:44] RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:45] PROBLEM - nova-compute proc maximum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:45] RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:50] RECOVERY - nova-compute proc maximum on cloudvirt1050 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:50] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:51] RECOVERY - nova-compute proc maximum on cloudvirt1053 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:52] RECOVERY - nova-compute proc minimum on cloudvirt1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:26] PROBLEM - nova-compute proc maximum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:29] PROBLEM - nova-compute proc maximum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:44] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:50] RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:51] RECOVERY - nova-compute proc maximum on cloudvirt1049 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:55:12] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10156264 (10Jelto) In `wikimedia-gitlab`, there have been some reports of failing jobs (cc... [08:55:44] PROBLEM - nova-compute proc maximum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:45] PROBLEM - nova-compute proc maximum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:45] PROBLEM - nova-compute proc maximum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:46] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:47] PROBLEM - nova-compute proc maximum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:48] PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:49] PROBLEM - nova-compute proc maximum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:50] PROBLEM - nova-compute proc maximum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:51] PROBLEM - nova-compute proc maximum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:52] PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:53] PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:54] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:55] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:27] RECOVERY - nova-compute proc maximum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:37] RECOVERY - nova-compute proc minimum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:45] RECOVERY - nova-compute proc maximum on cloudvirt1059 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:45] RECOVERY - nova-compute proc minimum on cloudvirt1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:56] (03CR) 10Jcrespo: [C:03+1] "There are some outstanding issues but no blocker. The wait time, however, should be much smaller than 50 seconds. The original 10 seconds " [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [08:57:15] (03Abandoned) 10Hashar: Read closed-labs as closed tag on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) (owner: 10Alex Monk) [08:59:45] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:59:46] RECOVERY - nova-compute proc maximum on cloudvirt1056 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:44] RECOVERY - nova-compute proc maximum on cloudvirt1035 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:44] RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:44] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:45] RECOVERY - nova-compute proc maximum on cloudvirt1033 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:46] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:47] RECOVERY - nova-compute proc maximum on cloudvirt1037 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:49] RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:50] RECOVERY - nova-compute proc maximum on cloudvirt1051 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:44] RECOVERY - nova-compute proc maximum on cloudvirt1039 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:44] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:45] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:45] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:46] RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:47] RECOVERY - nova-compute proc maximum on cloudvirt1042 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:48] RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:49] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:52] !log drain ganeti2026 T373104 [09:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:56] T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104 [09:02:43] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:02:45] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:02:46] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:02:46] RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:03:44] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:03:45] RECOVERY - nova-compute proc maximum on cloudvirt1044 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:03:45] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:03:46] RECOVERY - nova-compute proc maximum on cloudvirt1041 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:04:13] FIRING: JobUnavailable: Reduced availability for job mtail in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:05:45] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:05:46] RECOVERY - nova-compute proc maximum on cloudvirt1048 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:05:49] RECOVERY - nova-compute proc minimum on cloudvirt1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:05:50] RECOVERY - nova-compute proc maximum on cloudvirt1052 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:06:44] RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:06:45] RECOVERY - nova-compute proc maximum on cloudvirt1058 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:06:46] RECOVERY - nova-compute proc maximum on cloudvirt1054 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:06:49] RECOVERY - nova-compute proc minimum on cloudvirt1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:07:45] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:07:46] RECOVERY - nova-compute proc maximum on cloudvirt1061 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:13] RESOLVED: JobUnavailable: Reduced availability for job mtail in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:09:26] RECOVERY - nova-compute proc maximum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:29] RECOVERY - nova-compute proc maximum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:37] RECOVERY - nova-compute proc minimum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:38] RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:10:48] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4010/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073740 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:10:50] (03Abandoned) 10Hashar: Increase the url shortener url size limit from 2k to 5k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617843 (https://phabricator.wikimedia.org/T220703) (owner: 10Ladsgroup) [09:11:16] !log tappof@cumin2002 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [09:11:37] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [09:13:37] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:13:57] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:13:58] (03Abandoned) 10Hashar: Enable ORES on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489936 (https://phabricator.wikimedia.org/T215354) (owner: 10Catrope) [09:14:01] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:14:01] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:15:06] (03Merged) 10jenkins-bot: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [09:18:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 42s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:19:13] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:20:13] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10156310 (10phaultfinder) [09:20:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:23:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 26s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:24:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job mtail in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:25:57] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 376, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:26:01] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:26:01] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:26:27] !log tappof@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [09:26:29] 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10156328 (10elukey) I tried to generate a heap dump with jmap but it is very large and I'd need to copy it to my local laptop to inspect it via VisualVM. There is... [09:26:36] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:28:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 20s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:30:07] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol) [09:30:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 16.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:30:32] (03CR) 10Gmodena: changeprop: Enable PCS pregeneration without restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [09:32:02] (03PS1) 10David Caro: prometheus::cloud: increase ceph scrape timeout [puppet] - 10https://gerrit.wikimedia.org/r/1073744 [09:32:21] jouncebot: next [09:32:21] In 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1000) [09:33:22] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1073744 (owner: 10David Caro) [09:34:47] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4011/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073744 (owner: 10David Caro) [09:35:28] (03CR) 10David Caro: [V:03+1 C:03+2] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1073744 (owner: 10David Caro) [09:35:49] dhinus: Sorry for the slow reply. That is the associated puppet patch. [09:37:18] (03Abandoned) 10Hashar: Demo: how group permissions could look like [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738992 (owner: 10Ppchelko) [09:37:30] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#10156356 (10Volans) p:05Triage→03Medium Thanks for the task. I think the main decision to make is how fresh the data needs to be. If we opt f... [09:38:55] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:13] (03CR) 10DCausse: [C:03+1] wdqs max lag: target specific port (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1073533 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [09:41:51] (03PS7) 10Gmodena: ds8-k8s-service: add values for dumps2 job. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) [09:42:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 37.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:42:43] (03CR) 10Gmodena: ds8-k8s-service: add values for dumps2 job. (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [09:43:09] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol) [09:43:43] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:44:10] (03Merged) 10jenkins-bot: cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol) [09:44:15] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:44:19] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:46:01] 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10156378 (10jijiki) @Dreamy_Jazz I see these are SQL connection timeouts. While I dig into it, could you please let us know if that is impacting the iPoid (eg error rates, latency, or the sch... [09:46:15] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:46:19] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:46:31] Dreamy_Jazz: the patch looks good to me and I can deploy it after it gets merged, but I'd like to have a review from Data Engineering & Data Persistence as well. leave it with me, I'll ping some people [09:46:43] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:46:51] Thanks. [09:47:07] dhinus: Is there any way to test it before it gets merged? I couldn't see a way to do that easily. [09:47:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 35s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:47:47] I can use the components from the view to make an SQL query, but I'm not sure that is properly testing the change. [09:48:12] (03CR) 10Elukey: [C:03+1] netbox: notify dcops for uncommitted DNS changes [puppet] - 10https://gerrit.wikimedia.org/r/1073732 (owner: 10Volans) [09:48:22] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 (owner: 10Slyngshede) [09:48:55] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073732 (owner: 10Volans) [09:50:45] (03PS5) 10Slyngshede: Notify managers via email when new permission requests are made. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 [09:50:55] (03CR) 10Slyngshede: Notify managers via email when new permission requests are made. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 (owner: 10Slyngshede) [09:50:59] (03CR) 10Elukey: [C:03+1] icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 (owner: 10Tiziano Fogli) [09:51:43] (03CR) 10Muehlenhoff: [C:03+1] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1073532 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [09:52:25] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2124.codfw.wmnet - https://phabricator.wikimedia.org/T374847#10156389 (10ABran-WMF) a:05ABran-WMF→03None [09:52:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 (owner: 10Slyngshede) [09:52:29] (03CR) 10Slyngshede: [C:03+2] Notify managers via email when new permission requests are made. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 (owner: 10Slyngshede) [09:52:40] (03PS2) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) [09:52:47] Dreamy_Jazz: you can run the view query manually on a wikireplica, e.g. in quarry. not a complete test but it will catch any obvious errors in the view definition [09:53:15] Dreamy_Jazz: yes what you wrote basically, I started typing before reading your message :) [09:53:32] jnuche: o/ I am going to scap backport a mw-config change, just wanted to double check with you if it is ok [09:53:49] Cool. I did that on production and saw that I missed something [09:53:56] Updated the patch to fix that. [09:54:55] (03Merged) 10jenkins-bot: Notify managers via email when new permission requests are made. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073238 (owner: 10Slyngshede) [09:55:24] Tested the query again and it seems to be working now. Thanks for the advice. [09:55:27] Dreamy_Jazz: great. pcc/test experimental is not useful here, so I think that test is all we can do, plus checking with some other db experts [09:55:46] Sure. Would it be helpful to get a review from someone else on my team? [09:56:15] one more pair of eyes won't hurt :) [09:56:50] (03CR) 10Volans: [C:03+2] netbox: notify dcops for uncommitted DNS changes [puppet] - 10https://gerrit.wikimedia.org/r/1073732 (owner: 10Volans) [09:57:55] 06SRE: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066 (10fgiunchedi) 03NEW [09:57:58] (03PS1) 10Effie Mouzeli: ipoid: Set activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073748 (https://phabricator.wikimedia.org/T374414) [09:58:10] (03PS1) 10Tiziano Fogli: grafana: cluster name misc to grafana [puppet] - 10https://gerrit.wikimedia.org/r/1073749 (https://phabricator.wikimedia.org/T375066) [09:58:26] (03Abandoned) 10Effie Mouzeli: ipoid: Set activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [09:59:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 40s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:59:24] (03CR) 10FNegri: [C:03+1] "LGTM, but I'd like a +1 from Data Persistence and Data Engineering too. According to https://wikitech.wikimedia.org/wiki/Portal:Data_Servi" [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1000) [10:02:05] (03CR) 10Kosta Harlan: [C:03+1] [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [10:02:23] (03PS1) 10Volans: re.switchdc.databases.prepare: reduce wait time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) [10:02:30] (03CR) 10Lucas Werkmeister (WMDE): Check that throttling exceptions use valid public IP addresses (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [10:02:44] (03PS6) 10Lucas Werkmeister (WMDE): Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) [10:03:12] elukey: yeah, ok from my side [10:03:32] super thanks! [10:04:53] (03CR) 10Lucas Werkmeister (WMDE): Check that throttling exceptions use valid public IP addresses (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [10:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 10.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:05:44] (03PS1) 10Elukey: role::puppetserver: set the maximum number of instances [puppet] - 10https://gerrit.wikimedia.org/r/1073751 (https://phabricator.wikimedia.org/T373527) [10:05:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by elukey@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [10:06:39] (03Merged) 10jenkins-bot: Swap poolcounter2004 with poolcounter2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073427 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [10:06:41] (03PS2) 10Effie Mouzeli: ipoid: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073443 (https://phabricator.wikimedia.org/T356885) [10:06:58] (03PS2) 10Effie Mouzeli: ipoid: Set activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073748 (https://phabricator.wikimedia.org/T374414) [10:07:14] (03CR) 10Arnaudb: "totally optional comments, lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [10:07:16] !log elukey@deploy1003 Started scap sync-world: Backport for [[gerrit:1073427|Swap poolcounter2004 with poolcounter2006 (T332015)]] [10:07:20] T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015 [10:08:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [10:08:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [10:08:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/Wikibase] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073479 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [10:09:33] (03CR) 10Hnowlan: changeprop: Enable PCS pregeneration without restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [10:09:34] !log elukey@deploy1003 elukey: Backport for [[gerrit:1073427|Swap poolcounter2004 with poolcounter2006 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:09:40] !log elukey@deploy1003 elukey: Continuing with sync [10:09:58] 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10156425 (10Dreamy_Jazz) >>! In T375006#10156378, @jijiki wrote: > @Dreamy_Jazz I see these are SQL connection timeouts. While I dig into it, could you please let us know if that is impacting... [10:10:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:13:21] (03PS2) 10Volans: re.switchdc.databases.prepare: reduce wait time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) [10:14:01] 06SRE, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10156434 (10fgiunchedi) [10:14:13] (03CR) 10Filippo Giunchedi: [C:03+1] grafana: cluster name misc to grafana [puppet] - 10https://gerrit.wikimedia.org/r/1073749 (https://phabricator.wikimedia.org/T375066) (owner: 10Tiziano Fogli) [10:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:14:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 0s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:14:24] !log elukey@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073427|Swap poolcounter2004 with poolcounter2006 (T332015)]] (duration: 07m 08s) [10:14:29] T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015 [10:14:47] (03CR) 10Tiziano Fogli: [C:03+2] grafana: cluster name misc to grafana [puppet] - 10https://gerrit.wikimedia.org/r/1073749 (https://phabricator.wikimedia.org/T375066) (owner: 10Tiziano Fogli) [10:18:16] 06SRE, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10156444 (10fgiunchedi) [10:18:17] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 41.25s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:18:33] PROBLEM - poolcounter on poolcounter2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd https://www.mediawiki.org/wiki/PoolCounter [10:18:49] this is me -^ [10:19:02] the host is not serving anything atm, but restarting poolcounterd failed [10:19:05] PROBLEM - Poolcounter connection on poolcounter2003 is CRITICAL: connect to address 10.192.0.132 and port 7531: Connection refused https://www.mediawiki.org/wiki/PoolCounter [10:19:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 8.333% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:20:08] RECOVERY - Poolcounter connection on poolcounter2003 is OK: TCP OK - 0.001 second response time on 10.192.0.132 port 7531 https://www.mediawiki.org/wiki/PoolCounter [10:20:34] RECOVERY - poolcounter on poolcounter2003 is OK: PROCS OK: 1 process with command name poolcounterd https://www.mediawiki.org/wiki/PoolCounter [10:20:34] !log restart poolcounterd on poolcounter2003 (not serving any traffic atm, tried to clear old/stale conns) [10:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:10] 06SRE, 10observability, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10156447 (10fgiunchedi) [10:21:14] 06SRE, 10observability, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10156448 (10fgiunchedi) [10:22:43] (03CR) 10Arnaudb: [C:03+1] re.switchdc.databases.prepare: reduce wait time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [10:25:46] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye [10:27:06] (03PS1) 10Slyngshede: Context Processor: Check for signed in users before running processor. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073752 [10:27:44] (03CR) 10Muehlenhoff: [C:03+1] "Sounds good, let's give it a shot. We'll refresh puppetserver2003 in the forthcoming quarter and we'll buy it with 128G instead of 64, so " [puppet] - 10https://gerrit.wikimedia.org/r/1073751 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [10:27:56] (03PS2) 10Slyngshede: Context Processor: Check for signed in users before running processor. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073752 [10:28:17] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 45s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:28:22] FIRING: [4x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:45] FIRING: [4x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:37] RESOLVED: [4x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:31:34] (03CR) 10Slyngshede: [C:03+2] Context Processor: Check for signed in users before running processor. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073752 (owner: 10Slyngshede) [10:34:11] (03Merged) 10jenkins-bot: Context Processor: Check for signed in users before running processor. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073752 (owner: 10Slyngshede) [10:41:26] (03PS3) 10Slyngshede: Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849 [10:44:48] jouncebot: nowandnext [10:44:48] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1000) [10:44:48] In 0 hour(s) and 15 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1100) [10:46:58] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:47:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:47:48] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:47:50] (03CR) 10Jcrespo: [C:03+1] re.switchdc.databases.prepare: reduce wait time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [10:50:13] (03PS1) 10Dreamy Jazz: Hooks: Re-order checks to verify that request user is same as Special:Contributions user [extensions/ContentTranslation] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073755 (https://phabricator.wikimedia.org/T375061) [10:50:15] (03CR) 10Volans: [C:03+2] re.switchdc.databases.prepare: reduce wait time (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [10:52:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/ContentTranslation] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073755 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz) [10:52:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:54:52] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [10:55:09] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:56:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 53.03s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1100). [11:01:02] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#10156519 (10MoritzMuehlenhoff) [11:01:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 7.812s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:04:08] (03Merged) 10jenkins-bot: re.switchdc.databases.prepare: reduce wait time [cookbooks] - 10https://gerrit.wikimedia.org/r/1073750 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [11:07:39] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:08:17] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:09:23] (03PS1) 10Volans: sre.switchdc.databases: fix Phabricator message [cookbooks] - 10https://gerrit.wikimedia.org/r/1073757 (https://phabricator.wikimedia.org/T371351) [11:09:42] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T374901#10156547 (10MoritzMuehlenhoff) Great, many thanks! I'll rebuild the RAID and then I'll add the server back to active duty. Hopefully it works now for longer than a week :-) [11:12:00] (03PS2) 10Btullis: Add a cephosd cluster and assign it to the appropriate hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073434 (https://phabricator.wikimedia.org/T374932) [11:12:46] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4012/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073434 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [11:14:57] Anyone using this window? [11:15:14] Would like to see if I can backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1073755 which should resolve a train blocker [11:15:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:16:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073755 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz) [11:16:15] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:16:29] (03PS1) 10JMeybohm: Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) [11:16:38] (03CR) 10Jcrespo: [C:03+1] "https://phabricator.wikimedia.org/T374972#10156561" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073757 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [11:16:40] (03PS1) 10Muehlenhoff: Disable memcached ticket registry [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1073761 (https://phabricator.wikimedia.org/T367487) [11:18:11] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1073761 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [11:18:54] (03PS2) 10JMeybohm: Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) [11:23:14] !log Deploying refinery [11:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:40] !log tchin@deploy1003 Started deploy [analytics/refinery@bc0be94]: Regular analytics weekly train [analytics/refinery@bc0be94a] [11:23:54] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Disable memcached ticket registry [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1073761 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [11:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:24:16] (03CR) 10Volans: [C:03+2] sre.switchdc.databases: fix Phabricator message [cookbooks] - 10https://gerrit.wikimedia.org/r/1073757 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [11:27:03] (03PS1) 10Volans: sre.switchdc.databses: test on test-s4 section [cookbooks] - 10https://gerrit.wikimedia.org/r/1073762 [11:27:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:57] (03CR) 10Volans: [C:04-2] "DO NOT MERGE, just for testing purposed with test-cookbook for testing on test-s4 section" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073762 (owner: 10Volans) [11:30:45] (03PS2) 10Volans: sre.switchdc.databses: test on test-s4 section [cookbooks] - 10https://gerrit.wikimedia.org/r/1073762 [11:32:47] !log tchin@deploy1003 Finished deploy [analytics/refinery@bc0be94]: Regular analytics weekly train [analytics/refinery@bc0be94a] (duration: 09m 06s) [11:33:18] !log tchin@deploy1003 Started deploy [analytics/refinery@bc0be94] (thin): Regular analytics weekly train THIN [analytics/refinery@bc0be94a] [11:34:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:37:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:08] !log tchin@deploy1003 Finished deploy [analytics/refinery@bc0be94] (thin): Regular analytics weekly train THIN [analytics/refinery@bc0be94a] (duration: 05m 50s) [11:39:32] (03Merged) 10jenkins-bot: sre.switchdc.databases: fix Phabricator message [cookbooks] - 10https://gerrit.wikimedia.org/r/1073757 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [11:39:34] !log tchin@deploy1003 Started deploy [analytics/refinery@bc0be94] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bc0be94a] [11:43:31] !log tchin@deploy1003 Finished deploy [analytics/refinery@bc0be94] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bc0be94a] (duration: 03m 57s) [11:43:51] !log update pfw3-codfw dhcp-relay target 0 T375011 [11:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:12] (03PS3) 10Anzx: Lift IP cap on 2024-10-07/08 for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073586 (https://phabricator.wikimedia.org/T374964) [11:45:06] (03CR) 10CI reject: [V:04-1] sre.switchdc.databses: test on test-s4 section [cookbooks] - 10https://gerrit.wikimedia.org/r/1073762 (owner: 10Volans) [11:45:59] (03Merged) 10jenkins-bot: Hooks: Re-order checks to verify that request user is same as Special:Contributions user [extensions/ContentTranslation] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073755 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz) [11:46:18] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073755|Hooks: Re-order checks to verify that request user is same as Special:Contributions user (T375061)]] [11:46:22] T375061: InvalidArgumentException: Invalid username: - https://phabricator.wikimedia.org/T375061 [11:47:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073586 (https://phabricator.wikimedia.org/T374964) (owner: 10Anzx) [11:48:28] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1073755|Hooks: Re-order checks to verify that request user is same as Special:Contributions user (T375061)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:50:48] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [11:52:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:35] (03PS1) 10Hnowlan: shellbox-video: bypass mesh temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073770 (https://phabricator.wikimedia.org/T373517) [11:53:39] (03PS1) 10Dreamy Jazz: Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073771 (https://phabricator.wikimedia.org/T375061) [11:53:58] (03CR) 10Dreamy Jazz: [C:03+2] Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073771 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz) [11:54:07] (03PS1) 10Dreamy Jazz: Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073772 (https://phabricator.wikimedia.org/T375061) [11:54:14] (03CR) 10Dreamy Jazz: [C:03+2] Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073772 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz) [11:54:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073770 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [11:55:09] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:55:21] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073755|Hooks: Re-order checks to verify that request user is same as Special:Contributions user (T375061)]] (duration: 09m 03s) [11:55:25] (03PS1) 10Brouberol: airflow: define an internal service name for the scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073773 (https://phabricator.wikimedia.org/T375072) [11:55:25] T375061: InvalidArgumentException: Invalid username: - https://phabricator.wikimedia.org/T375061 [11:55:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073772 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz) [11:55:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073771 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz) [12:02:22] (03PS1) 10Muehlenhoff: profile::idp::build: Readd rsync service [puppet] - 10https://gerrit.wikimedia.org/r/1073775 (https://phabricator.wikimedia.org/T367487) [12:04:54] (03Merged) 10jenkins-bot: Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073771 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz) [12:04:56] (03Merged) 10jenkins-bot: Allow IP ranges in CentralAuth::getInstanceByName() [extensions/CentralAuth] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073772 (https://phabricator.wikimedia.org/T375061) (owner: 10Dreamy Jazz) [12:05:18] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073772|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]], [[gerrit:1073771|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]] [12:05:22] T375061: InvalidArgumentException: Invalid username: - https://phabricator.wikimedia.org/T375061 [12:07:35] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1073772|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]], [[gerrit:1073771|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:07:42] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [12:08:59] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1177.eqiad.wmnet with OS bullseye [12:09:19] (03CR) 10Brouberol: [C:03+1] Add a cephosd cluster and assign it to the appropriate hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073434 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [12:09:32] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: ensure each cloudnativePG cluster is assigned a unique s3 bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073446 (https://phabricator.wikimedia.org/T374938) (owner: 10Brouberol) [12:10:04] !log Deployed refinery using scap, then deployed onto hdfs [12:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:37] (03PS1) 10Filippo Giunchedi: hiera: set cluster for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/1073776 (https://phabricator.wikimedia.org/T375066) [12:12:19] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073772|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]], [[gerrit:1073771|Allow IP ranges in CentralAuth::getInstanceByName() (T375061)]] (duration: 07m 00s) [12:12:24] T375061: InvalidArgumentException: Invalid username: - https://phabricator.wikimedia.org/T375061 [12:12:31] Done my deploys for the train blocker [12:12:58] (03Abandoned) 10Cathal Mooney: Validate port block speed combo in server provision script for QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930264 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [12:13:55] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:12] (03CR) 10Brouberol: ds8-k8s-service: add values for dumps2 job. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [12:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:14:50] Dreamy_Jazz: thank for deploying the fix! [12:15:01] Np! [12:15:05] I'll roll forward the train in ~5 mins [12:15:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 1m 52s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:17:14] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529#10156770 (10cmooney) 05Open→03Resolved Validator is working well to prevent any mis-match, and automation is configuring things correc... [12:18:13] !log tchin@deploy1003 Started deploy [airflow-dags/analytics@e6cc31a]: Regular analytics weekly train [12:18:56] !log tchin@deploy1003 Finished deploy [airflow-dags/analytics@e6cc31a]: Regular analytics weekly train (duration: 01m 18s) [12:19:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:19:57] (03CR) 10Muehlenhoff: "Good catch! One comment inline, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:20:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 1m 52s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:21:05] (03CR) 10Gmodena: ds8-k8s-service: add values for dumps2 job. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070597 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [12:21:09] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073777 (https://phabricator.wikimedia.org/T373642) [12:21:11] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073777 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [12:21:22] !log tchin@deploy1003 Started deploy [airflow-dags/analytics_test@e6cc31a]: Regular analytics weekly train [12:21:39] !log tchin@deploy1003 Finished deploy [airflow-dags/analytics_test@e6cc31a]: Regular analytics weekly train (duration: 00m 20s) [12:21:58] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073777 (https://phabricator.wikimedia.org/T373642) (owner: 10TrainBranchBot) [12:23:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372) (owner: 10C. Scott Ananian) [12:23:37] (03PS3) 10C. Scott Ananian: Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372) [12:28:55] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.23 refs T373642 [12:29:00] T373642: 1.43.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T373642 [12:30:04] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1073775 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [12:33:19] !log uploaded cas 7.0.4.1+wmf12u3 T367487 [12:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:25] T367487: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487 [12:34:42] (03PS3) 10DCausse: wdqs categories: ship lastUpdated metric [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [12:35:16] (03CR) 10DCausse: "uploaded I828464daf76c9384545f2071963751effd5247cf and marked it as dependency" [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [12:36:41] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10156844 (10MoritzMuehlenhoff) ganeti2017 and ganeti2026 are drained [12:41:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:43:11] (03CR) 10Muehlenhoff: [C:03+2] profile::idp::build: Readd rsync service [puppet] - 10https://gerrit.wikimedia.org/r/1073775 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [12:44:00] !log uploaded purged 0.23 to bullseye-wikimedia (apt.wm.o) - T334078 [12:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:05] T334078: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 [12:44:22] (03PS2) 10Arturo Borrero Gonzalez: cloud: codfw1dev: have a new bastion host in bastion-codfw1dev-04 [puppet] - 10https://gerrit.wikimedia.org/r/1073205 (https://phabricator.wikimedia.org/T374828) [12:46:11] !log rolling upgrade to purged 0.23 in A:cp-ulsfo - T334078 [12:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 49.68s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:50:37] (03PS4) 10Slyngshede: Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849 [12:51:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 10% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:52:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:32] (03CR) 10Btullis: [C:03+1] airflow: define an internal service name for the scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073773 (https://phabricator.wikimedia.org/T375072) (owner: 10Brouberol) [12:53:49] (03PS1) 10Muehlenhoff: idp::build: Remove duplicate repository config [puppet] - 10https://gerrit.wikimedia.org/r/1073788 [12:54:00] (03PS1) 10KartikMistry: Updated cxserver to 2024-09-18-104433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073789 (https://phabricator.wikimedia.org/T375017) [12:54:09] (03CR) 10Btullis: [V:03+1 C:03+2] Add a cephosd cluster and assign it to the appropriate hosts [puppet] - 10https://gerrit.wikimedia.org/r/1073434 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [12:54:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 16.25s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:54:55] (03PS1) 10Dreamy Jazz: Hide temp account IP address viewing right from non-temp account wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073790 (https://phabricator.wikimedia.org/T369187) [12:54:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:55:16] (03PS1) 10Muehlenhoff: Failover idp-test [dns] - 10https://gerrit.wikimedia.org/r/1073791 [12:55:21] (03PS2) 10Dreamy Jazz: Hide temp account IP address viewing right from non-temp account wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073790 (https://phabricator.wikimedia.org/T369187) [12:55:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073790 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz) [12:56:04] (03CR) 10Alexandros Kosiaris: [C:03+1] "Sigh, good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [12:56:09] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:56:44] (03PS5) 10Slyngshede: Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849 [12:58:53] (03CR) 10Slyngshede: [C:03+1] "Looks good, forgot about that." [puppet] - 10https://gerrit.wikimedia.org/r/1073788 (owner: 10Muehlenhoff) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1300). [13:00:05] sergi0, Lucas_WMDE, Dreamy_Jazz, anzx, hnowlan, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:12] \o [13:00:19] hi [13:00:55] (03CR) 10Elukey: [C:03+2] role::puppetserver: set the maximum number of instances [puppet] - 10https://gerrit.wikimedia.org/r/1073751 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [13:01:37] (03CR) 10Ssingh: [C:03+1] Failover idp-test [dns] - 10https://gerrit.wikimedia.org/r/1073791 (owner: 10Muehlenhoff) [13:02:07] hi, Lucas patch for "Check that throttling exceptions use valid public IP addresses" can be merged out of the window ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073487 ) [13:02:32] jouncebot: next [13:02:32] In 0 hour(s) and 57 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1400) [13:02:37] (03CR) 10Cyndywikime: [C:03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073739 (https://phabricator.wikimedia.org/T374577) (owner: 10Sergio Gimeno) [13:02:45] and there are too many patches for this one hour window, so that is definitely going to be extended [13:03:04] (03CR) 10Dreamy Jazz: [C:03+2] GrowthExperiments: enable Community Updates module in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073739 (https://phabricator.wikimedia.org/T374577) (owner: 10Sergio Gimeno) [13:03:10] I can deploy [13:03:25] my change can't be tested on testservers and so can just go straight to prod [13:03:44] (03CR) 10Dreamy Jazz: [C:03+2] Hide temp account IP address viewing right from non-temp account wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073790 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz) [13:03:45] (03Merged) 10jenkins-bot: GrowthExperiments: enable Community Updates module in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073739 (https://phabricator.wikimedia.org/T374577) (owner: 10Sergio Gimeno) [13:04:05] anzx: Are you here for the window? [13:04:20] Lucas_WMDE: Do you want me to deploy your changes? [13:04:30] (03Merged) 10jenkins-bot: Hide temp account IP address viewing right from non-temp account wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073790 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz) [13:04:42] anzx change can be deployed as is, there is not much testing we can do for throttling :) [13:05:03] (03CR) 10Hashar: [C:03+1] Lift IP cap on 2024-10-07/08 for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073586 (https://phabricator.wikimedia.org/T374964) (owner: 10Anzx) [13:05:25] We could also merge that test to make sure the new patch works :) [13:05:47] (03CR) 10Dreamy Jazz: [C:03+2] Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [13:05:48] well the test simply cover there are no private IP used [13:05:55] Sure. [13:05:55] Dreamy_Jazz: o/ [13:06:06] (03CR) 10Dreamy Jazz: [C:03+2] Lift IP cap on 2024-10-07/08 for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073586 (https://phabricator.wikimedia.org/T374964) (owner: 10Anzx) [13:06:07] what I wonder is whether `scap backport` can deploy both changes at the same time [13:06:26] (03CR) 10Alexandros Kosiaris: [C:03+1] shellbox-video: bypass mesh temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073770 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [13:06:28] I would have thought so, but I can test that. [13:06:31] (03Merged) 10jenkins-bot: Check that throttling exceptions use valid public IP addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073487 (https://phabricator.wikimedia.org/T374980) (owner: 10Lucas Werkmeister (WMDE)) [13:06:45] (03Merged) 10jenkins-bot: Lift IP cap on 2024-10-07/08 for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073586 (https://phabricator.wikimedia.org/T374964) (owner: 10Anzx) [13:06:57] (03CR) 10Dreamy Jazz: [C:03+2] shellbox-video: bypass mesh temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073770 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [13:07:06] i'm here [13:07:23] (03CR) 10Dreamy Jazz: [C:03+2] Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372) (owner: 10C. Scott Ananian) [13:07:38] (03Merged) 10jenkins-bot: shellbox-video: bypass mesh temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073770 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [13:07:44] Lucas_WMDE: Are you around for your wmf backports? [13:08:05] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073541 (https://phabricator.wikimedia.org/T374372) (owner: 10C. Scott Ananian) [13:08:22] (03CR) 10Ssingh: "This is ready for review from Traffic." [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [13:08:37] (03PS1) 10Klausman: aptrepo: Add ROCm61 component for ML-Labs machines [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) [13:08:43] (03CR) 10Ssingh: "I mean from our perspective this is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [13:09:21] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:09:24] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073739|GrowthExperiments: enable Community Updates module in testwiki (T374577)]], [[gerrit:1073487|Check that throttling exceptions use valid public IP addresses (T374980)]], [[gerrit:1073790|Hide temp account IP address viewing right from non-temp account wikis (T369187)]], [[gerrit:1073586|Lift IP cap on 2024-10-07/08 for edit-a-thon (T374964)]] [13:09:25] , [[gerrit:1073770|shellbox-video: bypass mesh temporarily (T373517)]], [[gerrit:1073541|Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage (T374372)]] [13:09:29] As lucas hasn't said they are around, I'm going to proceed with all but the wmf backports [13:09:33] T374577: Community Updates module: Release to Test Wikipedia - https://phabricator.wikimedia.org/T374577 [13:09:33] T374980: Enforce exclusion of private IP addresses from $wmgThrottlingExceptions in CI - https://phabricator.wikimedia.org/T374980 [13:09:33] T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187 [13:09:34] T374964: Lift IP cap on this dates 2024-10-07/08 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T374964 [13:09:34] T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517 [13:09:34] T374372: Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage (week of Sep 16) - https://phabricator.wikimedia.org/T374372 [13:09:49] They would take longer to merge, so we can always come back to them later [13:11:53] !log dreamyjazz@deploy1003 sgimeno, anzx, lucaswerkmeister-wmde, cscott, hnowlan, dreamyjazz: Backport for [[gerrit:1073739|GrowthExperiments: enable Community Updates module in testwiki (T374577)]], [[gerrit:1073487|Check that throttling exceptions use valid public IP addresses (T374980)]], [[gerrit:1073790|Hide temp account IP address viewing right from non-temp account wikis (T369187)]], [[gerrit:1073586|Lift IP cap on [13:11:53] 2024-10-07/08 for edit-a-thon (T374964)]], [[gerrit:1073770|shellbox-video: bypass mesh temporarily (T373517)]], [[gerrit:1073541|Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage (T374372)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:56] cscott: sergi0: Please test your changes (if any testing is required). [13:12:08] Let me know if you don't need to test it. [13:12:12] Nothing testable, I'll check in testwiki [13:12:19] i can check that the defaults changed, hang on [13:13:07] dammit, I forgot I scheduled patches for this window [13:13:20] unfortunately I also have a meeting with some WMF folks in a few minutes [13:13:26] so I think I’ll just pass and try to deploy my patches another time, sorry [13:13:50] No problem. I've merged the patch to test the IP addresses, but left the others. [13:14:01] thanks! [13:14:50] Dreamy_Jazz: ok, checked & verified. looks good! [13:14:59] Thanks. [13:15:11] RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:15:23] sergi0: Okay to proceed on your patch given that nothing is testable? [13:15:32] yes [13:15:48] My change is a no-op and I've tested that it doesn't break anything, so proceeding. [13:15:50] !log dreamyjazz@deploy1003 sgimeno, anzx, lucaswerkmeister-wmde, cscott, hnowlan, dreamyjazz: Continuing with sync [13:18:16] !log restart puppetserver on puppetserver1002 - trashing - T373527 [13:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:21] T373527: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527 [13:19:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:19:50] wikitech.wikimedia.org seems to redirect to foundation.wikimedia.org for me. is that a known thing? [13:20:08] Yes, because you have the debug extension set to enabled [13:20:09] (03PS3) 10Tiziano Fogli: icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 [13:20:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:20:23] Dreamy_Jazz: oh, that's a "feature"? [13:20:33] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073739|GrowthExperiments: enable Community Updates module in testwiki (T374577)]], [[gerrit:1073487|Check that throttling exceptions use valid public IP addresses (T374980)]], [[gerrit:1073790|Hide temp account IP address viewing right from non-temp account wikis (T369187)]], [[gerrit:1073586|Lift IP cap on 2024-10-07/08 for edit-a-thon (T374964)] [13:20:33] ], [[gerrit:1073770|shellbox-video: bypass mesh temporarily (T373517)]], [[gerrit:1073541|Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage (T374372)]] (duration: 11m 08s) [13:20:40] T374577: Community Updates module: Release to Test Wikipedia - https://phabricator.wikimedia.org/T374577 [13:20:40] thanks Dreamy_Jazz! [13:20:40] T374980: Enforce exclusion of private IP addresses from $wmgThrottlingExceptions in CI - https://phabricator.wikimedia.org/T374980 [13:20:40] T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187 [13:20:41] T374964: Lift IP cap on this dates 2024-10-07/08 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T374964 [13:20:41] T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517 [13:20:41] T374372: Deploy Parsoid Read Views to fa/nl/pl/pt/uk wikivoyage (week of Sep 16) - https://phabricator.wikimedia.org/T374372 [13:21:40] (03CR) 10Bking: flink-app: customize calico label selector (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [13:22:00] If you disable the debug extension and clear your cache it should fix it [13:22:31] (03PS4) 10Tiziano Fogli: icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 [13:23:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:24:02] (03PS5) 10Tiziano Fogli: icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 [13:25:08] !log Afternoon UTC backport window done [13:25:09] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10157009 (10elukey) puppetserver1002 is now running with 35 JRuby workers instead of 48, let's see how it goes at steady state. If everything... [13:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:30] (03CR) 10Filippo Giunchedi: [C:03+1] icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 (owner: 10Tiziano Fogli) [13:25:40] \o/ [13:25:48] Dreamy_Jazz: thanks! [13:25:48] thanks for deploying Dreamy_Jazz! [13:26:04] Thank you @Dreamy_Jazz [13:26:04] :D [13:26:44] (03CR) 10Tiziano Fogli: [C:03+2] icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 (owner: 10Tiziano Fogli) [13:26:47] The issue with the incorrect redirect should be fixed in a few weeks once wikitech.wikimedia.org is part of the production cluster. [13:27:55] (03PS2) 10Elukey: Swap poolcounter1004 with poolcounter1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073502 (https://phabricator.wikimedia.org/T332015) [13:28:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 37.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:28:25] oh wikifunctions again [13:28:28] hey folks, since the UTC backport is done I am going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073502 if nobody opposes [13:28:33] elukey: please hold [13:28:39] sure [13:28:41] there are some more patches :) [13:29:17] Lucas_WMDE: you aren't pushing the termbox updates? [13:29:23] I’m in a meeting [13:29:25] elukey: my bad go ahead, all the config patches go tpushed [13:29:27] maybe I’ll do them later [13:29:32] super thanks! [13:29:33] Lucas_WMDE: we can do it together after your meeting :] [13:29:44] ok ^^ [13:29:51] I should be free in 30 minutes from now [13:29:53] just poke me when you are done [13:29:58] ok, thanks! [13:29:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by elukey@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073502 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [13:30:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:31:00] (03Merged) 10jenkins-bot: Swap poolcounter1004 with poolcounter1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073502 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [13:31:21] !log elukey@deploy1003 Started scap sync-world: Backport for [[gerrit:1073502|Swap poolcounter1004 with poolcounter1006 (T332015)]] [13:31:25] T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015 [13:31:53] I've got a general question about prometheus stats, is this a good place to ask it? [13:32:23] q is: how do i test a new metric locally? i'd like something which just dumped the metric to a log somewhere so I could verify it was being generated correctly. [13:32:54] (03CR) 10Brouberol: [C:03+2] airflow: define an internal service name for the scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073773 (https://phabricator.wikimedia.org/T375072) (owner: 10Brouberol) [13:33:01] i think i'd set up a statsd server locally at one point, but my new metrics don't have "backward-compatible" statsd names. [13:33:33] $wgStatsdServer is documented, but no mention of prometheus in MainConfigSchema.php ? [13:33:40] I’ve used `nc -ukl 8125` before (listen on the statsd port, dump to stdout) [13:33:42] !log elukey@deploy1003 elukey: Backport for [[gerrit:1073502|Swap poolcounter1004 with poolcounter1006 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:34:00] !log elukey@deploy1003 elukey: Continuing with sync [13:34:27] yeah, but i'm not calling ::copyToStatsdAt() for these, so I don't think they are going to show up on statsd [13:35:15] I see :/ [13:35:45] 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10157069 (10jijiki) @Dreamy_Jazz I have update the [[ https://grafana-rw.wikimedia.org/d/6C9Bm6uVz/ipoid?orgId=1 | Grafana dashboard ]], to include any metrics emitted by envoy. Do you have a... [13:36:55] FWIW, I think #wikimedia-observability is the channel where I got some pretty good help on statslib-related questions before [13:37:03] irc or slack/ [13:37:10] IRC [13:37:14] ok, thanks! [13:37:19] right, slack reuses the #, I forgot ^^ [13:37:31] 06SRE, 10observability, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10157088 (10fgiunchedi) [13:38:36] !log elukey@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073502|Swap poolcounter1004 with poolcounter1006 (T332015)]] (duration: 07m 15s) [13:38:41] T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015 [13:40:35] (03CR) 10Elukey: "Looks good! Since this is a big jump, have you tried to install the packages on a Debian Bookworm container (or similar)? I am wondering i" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman) [13:42:22] (03PS2) 10Elukey: Swap poolcounter1005 with poolcounter1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073503 (https://phabricator.wikimedia.org/T332015) [13:43:09] 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10157120 (10Dreamy_Jazz) >>! In T375006#10157069, @jijiki wrote: > @Dreamy_Jazz I have updated the [[ https://grafana-rw.wikimedia.org/d/6C9Bm6uVz/ipoid?orgId=1 | Grafana dashboard ]], to inc... [13:44:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by elukey@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073503 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [13:44:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:46:06] (03Merged) 10jenkins-bot: Swap poolcounter1005 with poolcounter1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073503 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [13:46:26] (03CR) 10Klausman: "My private workstation has a 7900XTX (gfx1100 from generation pov) GPU, and is running Trixie (-> kernel version). I created a chroot usin" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman) [13:46:27] !log elukey@deploy1003 Started scap sync-world: Backport for [[gerrit:1073503|Swap poolcounter1005 with poolcounter1007 (T332015)]] [13:46:32] 06SRE, 10iPoid-Service: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10157135 (10Dreamy_Jazz) Looking at the data that is now in Grafana (thanks for doing that btw :D ), it seems that the server is responding with 500 errors when these connection timeouts occu... [13:46:33] T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015 [13:48:35] !log elukey@deploy1003 elukey: Backport for [[gerrit:1073503|Swap poolcounter1005 with poolcounter1007 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:49:06] (03CR) 10Elukey: "Super, seems perfect! I noticed that you added the new component under bullseye-wikimedia, should it be bookworm-wikimedia?" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman) [13:49:16] !log elukey@deploy1003 elukey: Continuing with sync [13:50:27] (03PS3) 10Brouberol: airflow: allow the webserver and scheduler to be deployed or not [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) [13:52:07] (03PS4) 10Brouberol: airflow: allow the webserver and scheduler to be selectively deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) [13:52:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:21] (03CR) 10Herron: "nice one thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1073749 (https://phabricator.wikimedia.org/T375066) (owner: 10Tiziano Fogli) [13:53:30] (03PS10) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [13:53:50] (03PS5) 10Brouberol: airflow: allow the webserver and scheduler to be selectively deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) [13:53:51] !log elukey@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073503|Swap poolcounter1005 with poolcounter1007 (T332015)]] (duration: 07m 23s) [13:53:55] T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015 [13:55:16] (03PS1) 10Ssingh: haproxy: switch order of TLS1.3 ciphers [puppet] - 10https://gerrit.wikimedia.org/r/1073798 (https://phabricator.wikimedia.org/T365327) [13:56:10] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:56:13] (03PS2) 10Klausman: aptrepo: Add ROCm61 component for ML-Labs machines [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) [13:56:31] (03CR) 10Klausman: "Ah, the Bullseye bit was my bad. Fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman) [13:56:52] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4013/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073798 (https://phabricator.wikimedia.org/T365327) (owner: 10Ssingh) [13:57:18] (03PS11) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [13:57:35] (03CR) 10Bking: flink-app: customize calico label selector (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1400) [14:00:46] (03CR) 10Elukey: [C:03+1] "Left a nit, once fixed feel free to merge!" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman) [14:03:08] (03PS1) 10Elukey: services: remove old poolcounter nodes from MW's net policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073802 (https://phabricator.wikimedia.org/T332015) [14:06:48] (03PS3) 10Klausman: aptrepo: Add ROCm61 component for ML-Labs machines [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) [14:07:00] (03CR) 10Klausman: aptrepo: Add ROCm61 component for ML-Labs machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman) [14:07:29] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:07:46] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:07:58] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: cloudgw: add support and enable IPv6 - https://phabricator.wikimedia.org/T374716#10157199 (10aborrero) p:05Triage→03Medium [14:08:00] 06SRE, 10Observability-Metrics, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10157200 (10lmata) [14:08:06] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10157201 (10aborrero) p:05Triage→03Medium [14:08:12] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: verify security groups settings for IPv6 - https://phabricator.wikimedia.org/T374714#10157202 (10aborrero) p:05Triage→03Medium [14:08:26] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10157203 (10aborrero) p:05Triage→03Medium [14:13:38] 10ops-eqiad, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10157238 (10aborrero) p:05Triage→03Medium [14:13:43] 10ops-eqiad, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10157235 (10aborrero) hey @VRiley-WMF could you please advice what should we do with the memory error in this server? [14:14:29] (03PS12) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [14:14:34] (03CR) 10Bking: flink-app: customize calico label selector (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [14:16:38] (03PS1) 10Ssingh: wikidough: change order of TLS1.3 cipher suites [puppet] - 10https://gerrit.wikimedia.org/r/1073803 (https://phabricator.wikimedia.org/T365327) [14:17:34] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4014/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073803 (https://phabricator.wikimedia.org/T365327) (owner: 10Ssingh) [14:18:48] (03CR) 10Ssingh: [V:03+1 C:03+2] wikidough: change order of TLS1.3 cipher suites [puppet] - 10https://gerrit.wikimedia.org/r/1073803 (https://phabricator.wikimedia.org/T365327) (owner: 10Ssingh) [14:19:14] (03PS4) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [14:19:24] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:19:46] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:21:05] (03PS5) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) [14:22:40] (03CR) 10CI reject: [V:04-1] cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [14:23:47] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:24:00] !log run puppet agent on A:wikidough [14:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:33] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:25:59] (03PS6) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) [14:26:45] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [14:32:52] (03CR) 10CI reject: [V:04-1] sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [14:33:08] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:33:20] ^ expected, rolling restarts of Wikimedia DNS [14:33:41] (03PS3) 10EoghanGaffney: contint: switch java_home from jdk-11 to jdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [14:33:43] (03PS1) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) [14:34:08] (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [14:34:10] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): openstack: create a cookbook to inject commands to VMs via console at scale - https://phabricator.wikimedia.org/T347683#10157384 (10aborrero) p:05Triage→03Low [14:35:44] (03PS2) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) [14:36:06] (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [14:36:46] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:38:00] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Observability-Alerting, 10Sustainability (Incident Followup): monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs" - https://phabricator.wikimedia.org/T347694#10157385 (10dcaro) 05Open→03Resolv... [14:38:08] RECOVERY - MD RAID on puppetmaster1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:38:34] (03PS3) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) [14:38:56] (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [14:39:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:17] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [14:42:46] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:44:23] (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [14:44:54] (03PS5) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) [14:45:02] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:45:07] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:46:30] (03PS4) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) [14:46:32] jouncebot: nowandnext [14:46:32] For the next 0 hour(s) and 13 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1400) [14:46:32] In 0 hour(s) and 13 minute(s): Alert hosts failover to alert1002 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1500) [14:46:44] hashar: I totally forgot to ping you, sorry [14:46:51] (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [14:46:59] probably not a good time right now, don’t think we want to cut into the alert hosts failover window [14:46:59] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:47:04] (and CI is definitely going to take more than 13 minutes) [14:47:09] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:48:24] (03PS5) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) [14:48:46] (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [14:49:00] Lucas_WMDE: no worries, we can do both at the same time? [14:49:38] ah alert host grr [14:49:54] (03CR) 10Vgutierrez: [C:03+1] haproxy: switch order of TLS1.3 ciphers [puppet] - 10https://gerrit.wikimedia.org/r/1073798 (https://phabricator.wikimedia.org/T365327) (owner: 10Ssingh) [14:49:55] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:50:22] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:50:34] (03PS6) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) [14:50:44] (03CR) 10Filippo Giunchedi: [C:03+1] alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [14:50:47] Lucas_WMDE: then that termbox patch is a frontend fix isn't it? My guess is we can merge both and deploy after alert has been switched [14:50:53] hashar: actually, now that group1 is on wmf.23, I guess the wmf.22 backport can already be discarded anyway [14:50:56] (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [14:51:03] it should only affect the frontend yeah [14:51:09] I don’t think we load any PHP code from that submodule [14:51:17] 06SRE, 06cloud-services-team, 06Traffic, 13Patch-For-Review: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463#10157462 (10joanna_borun) p:05Triage→03Low [14:52:01] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 55655 [14:52:07] hmm and somehow the gate pipeline worked yesterday but the change did not merge bah [14:52:19] I removed the +2s because the scap backport had already died [14:52:25] (due to the failed test builds I think) [14:52:35] lets +2 the wmf.23 one [14:52:35] (03PS7) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) [14:52:39] (though at the time I thought one of the failed builds was a gate-and-submit build. I didn’t know scap backport also died on failed test builds) [14:52:40] (03CR) 10Hashar: [C:03+2] Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [14:52:43] ok! [14:52:50] PROBLEM - poolcounter on poolcounter1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd https://www.mediawiki.org/wiki/PoolCounter [14:52:57] (03CR) 10CI reject: [V:04-1] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [14:53:00] 06SRE, 06cloud-services-team, 06Traffic, 13Patch-For-Review: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463#10157468 (10dcaro) Still some stuff to be changed: https://codesearch.wmcloud.org/search/?q=labweb [14:53:27] (03Abandoned) 10Lucas Werkmeister (WMDE): Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1073479 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [14:53:32] I abandoned the wmf.22 one [14:53:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 55655 [14:53:37] (can still be restored if needed ^^) [14:53:50] RECOVERY - poolcounter on poolcounter1004 is OK: PROCS OK: 1 process with command name poolcounterd https://www.mediawiki.org/wiki/PoolCounter [14:54:01] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:54:19] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:54:28] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:54:33] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:55:52] (03PS4) 10Hashar: contint: switch Jenkins to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [14:55:53] !log restart poolcounter on poolcounter100[4,5] (depooled nodes) to clear old/stale TCP conns for port 7531 [14:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:11] (03CR) 10Hashar: [C:03+1] "Eoghan and I will deploy it on Thursday 19 Sep at 8:30 UTC." [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [14:58:31] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#10157485 (10Scott_French) Thanks for taking a look, Riccardo. I should mention, this isn't blocking anything on our end, as I can always do somet... [15:00:05] denisse and godog: It is that lovely time of the day again! You are hereby commanded to deploy Alert hosts failover to alert1002. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1500). [15:00:13] godog: Ready! [15:00:22] denisse: sweet, same [15:00:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:50] !log Disable meta-monitoring for the alert hosts - T372418 [15:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:54] T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418 [15:01:40] (03PS8) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) [15:01:44] !log Make alert1002 the active host - T372418 [15:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:02] (03CR) 10Andrea Denisse: [C:03+2] alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [15:02:05] (03CR) 10Bking: [C:03+2] flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [15:02:12] (03PS7) 10Andrea Denisse: alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) [15:02:23] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4022/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [15:02:27] (03CR) 10Bking: [C:03+2] "self-merging based on verbal approval during pairing session" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [15:02:27] !log sudo cumin "A:cp" 'disable-puppet "merging CR 1073798"': T365327 [15:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:36] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [15:03:26] (03Merged) 10jenkins-bot: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [15:03:40] (03CR) 10Ssingh: [V:03+1 C:03+2] haproxy: switch order of TLS1.3 ciphers [puppet] - 10https://gerrit.wikimedia.org/r/1073798 (https://phabricator.wikimedia.org/T365327) (owner: 10Ssingh) [15:03:50] (03PS1) 10Elukey: services: update Tegola's Docker image to pick up package upgrades [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073818 (https://phabricator.wikimedia.org/T373976) [15:03:58] <_joe_> !log uploading conftool 3.2.4 to apt T375059 [15:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:03] T375059: Requestctl sync writes unchanged objects - https://phabricator.wikimedia.org/T375059 [15:06:18] denisse: can you let me know once alert has been switched over? I will deploy a MediaWiki update once you are done :) [15:06:48] hashar: For sure, I'll let you know, thank you. [15:07:15] Lucas_WMDE: of course something unrelated exploded :/ [15:07:33] (03CR) 10JHathaway: [C:03+2] tftpboot: purge old files [puppet] - 10https://gerrit.wikimedia.org/r/1073532 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [15:07:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10157535 (10VRiley-WMF) a:03VRiley-WMF [15:08:22] (03CR) 10Andrea Denisse: [C:03+2] alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [15:08:44] !log Resolve alerts DNS queries to alert1002 - T372418 [15:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:51] T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418 [15:11:47] (03CR) 10CI reject: [V:04-1] Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [15:11:58] :( [15:12:13] maybe cause tests are running in parallel [15:12:25] this one looks familiar [15:12:46] aha, https://phabricator.wikimedia.org/T374912 [15:12:54] that was indeed related to the parallel tests (IIUC) [15:13:12] FIRING: [3x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:18] FIRING: [2x] ProbeDown: Service puppetmaster2001:8140 has failed probes (http_puppetmaster2001_codfw_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:13:22] I wonder if it’s flaky or if that CheckUser fix needs to be backported for the backport to pass [15:13:35] oh great [15:13:45] well I guess I can backport the fix :) [15:13:53] FIRING: [2x] ProbeDown: Service puppetmaster2001:8141 has failed probes (http_puppetmaster2001_codfw_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:06] <_joe_> uh [15:14:07] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2013.codfw.wmnet [15:14:22] maybe it would be good to pause all activity for a while [15:14:27] RESOLVED: JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:14:27] and ignore the alerts [15:14:39] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2013.codfw.wmnet [15:14:43] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloud: codfw1dev: have a new bastion host in bastion-codfw1dev-04 [puppet] - 10https://gerrit.wikimedia.org/r/1073205 (https://phabricator.wikimedia.org/T374828) (owner: 10Arturo Borrero Gonzalez) [15:14:46] <_joe_> well it's hard to ignroe alerts [15:14:49] the switching of the alerting server seems a big deal to me [15:14:49] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2014.codfw.wmnet [15:14:50] (03PS1) 10Hashar: Add scope to temporary users created by populate tables test [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073823 (https://phabricator.wikimedia.org/T374912) [15:14:56] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:25] (03PS2) 10Hashar: Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [15:15:26] FIRING: [2x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:26] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2014.codfw.wmnet [15:15:36] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2024.codfw.wmnet [15:15:42] FIRING: JobUnavailable: Reduced availability for job icinga-am in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:09] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2024.codfw.wmnet [15:16:19] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2048.codfw.wmnet [15:16:41] (03CR) 10Hashar: [C:03+2] "Retrying due to `SpecialCentralAuthTest::testViewForExistingGlobalTemporaryAccount` failing to find `centralauth-admin-info-expired` / T37" [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [15:16:52] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2048.codfw.wmnet [15:17:03] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2049.codfw.wmnet [15:17:19] (03CR) 10Hashar: [C:03+2] "Cherry picked to let us backport the Wikibase change https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1073478" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073823 (https://phabricator.wikimedia.org/T374912) (owner: 10Hashar) [15:17:35] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2049.codfw.wmnet [15:17:46] <_joe_> is it expected that so many probes would fail? [15:17:46] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2050.codfw.wmnet [15:18:03] Lucas_WMDE: thanks for having found the CheckUser fix. I have backported it / +2ed it for wmf.23 and made your Wikibase change depends on it and +2ed it as well. We will see! [15:18:12] RESOLVED: [3x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:13] FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:18] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2050.codfw.wmnet [15:18:29] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2051.codfw.wmnet [15:18:47] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on vrts2002.codfw.wmnet with reason: Migration [15:18:58] hashar: thanks! [15:19:01] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on vrts2002.codfw.wmnet with reason: Migration [15:19:05] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2051.codfw.wmnet [15:19:14] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:19:15] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2444.codfw.wmnet [15:19:37] Thanks for backporting that fix. [15:19:39] !log rolling out TLS1.3 cipher suite priority order change CR 1073798 to all cp hosts [15:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:43] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:19:48] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2444.codfw.wmnet [15:19:52] jouncebot: nowandnext [15:19:52] For the next 0 hour(s) and 40 minute(s): Alert hosts failover to alert1002 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1500) [15:19:52] In 1 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700) [15:19:58] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2445.codfw.wmnet [15:20:20] !log aokoth@cumin1002 START - Cookbook sre.hosts.remove-downtime for vrts2002.codfw.wmnet [15:20:20] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for vrts2002.codfw.wmnet [15:20:20] Just checking the calendar for any free spots in a 30 mins or so [15:20:31] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2445.codfw.wmnet [15:20:42] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Migration [15:20:46] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Migration [15:20:56] (03PS1) 10Gerrit maintenance bot: Add nr to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1073827 (https://phabricator.wikimedia.org/T375087) [15:21:07] Dreamy_Jazz: this is probably not the best time to deploy [15:21:08] !log Enable metamonitoring for the alert1002, and alert2002 hosts - T372418 [15:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:12] T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418 [15:21:33] Sure. Would be it be better once the current window is over? [15:21:46] godog: I think we're done, everything looks good to me. What do you think? [15:21:59] Dreamy_Jazz: Please give me a couple of minutes and I'll let you know once we're donee. [15:22:09] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:22:16] That's fine. I'm definitely going to wait until the Wikibase change is backported. [15:22:26] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:22:46] denisse: sgtm [15:23:34] Yes, I think we're done. [15:23:47] hashar Dreamy_Jazz You can deploy now, thanks for your patience. [15:24:23] thx! [15:24:35] (03PS9) 10Btullis: Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) [15:25:12] godog: I think we can proceed with the decommission of the old hosts now, I already have patches for that. https://phabricator.wikimedia.org/T372607 [15:25:16] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4024/co" [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [15:25:25] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4023/console" [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman) [15:25:30] Or we could wait for next week if that's more appropriate. [15:25:36] !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [15:25:50] denisse: yeah let's wait a few days, I'll be reviewing your patches [15:26:04] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [15:26:05] (03CR) 10Klausman: [V:03+1 C:03+2] aptrepo: Add ROCm61 component for ML-Labs machines [puppet] - 10https://gerrit.wikimedia.org/r/1073794 (https://phabricator.wikimedia.org/T375076) (owner: 10Klausman) [15:26:23] godog: Thank you, I'll also double check them to ensure everything is correct. [15:26:28] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2446.codfw.wmnet [15:26:35] 06SRE, 10SRE-Access-Requests: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10157756 (10Vgutierrez) p:05Triage→03Medium [15:26:45] (03CR) 10Volans: "nit/question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [15:27:01] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2446.codfw.wmnet [15:27:02] (03CR) 10Kosta Harlan: [C:03+1] Add scope to temporary users created by populate tables test [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073823 (https://phabricator.wikimedia.org/T374912) (owner: 10Hashar) [15:27:11] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2447.codfw.wmnet [15:27:43] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2447.codfw.wmnet [15:27:54] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2448.codfw.wmnet [15:28:30] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2448.codfw.wmnet [15:28:40] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2449.codfw.wmnet [15:28:41] 06SRE, 10SRE-Access-Requests: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10157766 (10Vgutierrez) 05Open→03Stalled a:03Vgutierrez per data.yaml we need approval from @odimitrijevic / @Milimetric / @WDoranWMF / @Ahoelzl / @Ottomata (one of them is enough) [15:28:48] (03Abandoned) 10Ladsgroup: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1040887 (https://phabricator.wikimedia.org/T367020) (owner: 10Gerrit maintenance bot) [15:29:01] (03CR) 10Ladsgroup: [C:03+2] Add nr to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1073827 (https://phabricator.wikimedia.org/T375087) (owner: 10Gerrit maintenance bot) [15:29:16] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2449.codfw.wmnet [15:29:27] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2450.codfw.wmnet [15:30:00] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2450.codfw.wmnet [15:30:10] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2451.codfw.wmnet [15:30:12] (03CR) 10Dzahn: [C:03+1] "you beat me in the race, patch already open :p" [dns] - 10https://gerrit.wikimedia.org/r/1073827 (https://phabricator.wikimedia.org/T375087) (owner: 10Gerrit maintenance bot) [15:30:17] !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [15:30:29] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [15:30:43] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2451.codfw.wmnet [15:46:01] (03PS3) 10Vgutierrez: admin: Grant cyndywikime shell and analytics_privatedata_users access [puppet] - 10https://gerrit.wikimedia.org/r/1073834 (https://phabricator.wikimedia.org/T375060) [15:46:36] (03PS1) 10Jdlrobson: Limit quick surveys to wikis with messages defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073839 (https://phabricator.wikimedia.org/T374654) [15:46:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073839 (https://phabricator.wikimedia.org/T374654) (owner: 10Jdlrobson) [15:46:55] (03CR) 10Brouberol: [C:03+1] Add rclone to db1208 for testing s3 -> local backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [15:47:38] (03CR) 10Dzahn: [C:03+1] "thanks for adding that comment. and user details look all good to me. just needs the approval." [puppet] - 10https://gerrit.wikimedia.org/r/1073834 (https://phabricator.wikimedia.org/T375060) (owner: 10Vgutierrez) [15:48:10] (03PS1) 10Hnowlan: Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) [15:48:25] (03CR) 10Btullis: [V:03+1 C:03+2] Add rclone to db1208 for testing s3 -> local backups [puppet] - 10https://gerrit.wikimedia.org/r/1073810 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [15:48:26] (03CR) 10Ssingh: sre.cdn.pdns-recursor: add rolling restart script (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [15:48:42] Lucas_WMDE: patches are almost merged [15:48:51] (03CR) 10CI reject: [V:04-1] Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [15:48:53] * hashar grabs chocolate [15:49:55] (03Merged) 10jenkins-bot: Add scope to temporary users created by populate tables test [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073823 (https://phabricator.wikimedia.org/T374912) (owner: 10Hashar) [15:49:58] (03Merged) 10jenkins-bot: Update termbox (mul support) [extensions/Wikibase] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073478 (https://phabricator.wikimedia.org/T373088) (owner: 10Lucas Werkmeister (WMDE)) [15:50:33] yay [15:51:07] so hmm [15:51:30] oh my god [15:51:37] I do a git remote update ont he deployment server and... [15:51:40] fatal: exec 'rev-list': cd to 'view/lib/wikibase-termbox' failed: No such file or directory [15:52:31] oh wrong branch [15:53:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10157879 (10Vgutierrez) [15:54:09] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10157890 (10Vgutierrez) SSH key has been confirmed out of band [15:54:14] !log hashar@deploy1003 Started scap sync-world: Update termbox (mul support) - T373088 [15:54:19] T373088: [MUL] placeholder labels not appearing on mobile - https://phabricator.wikimedia.org/T373088 [15:54:34] (03PS2) 10Hnowlan: Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) [15:55:02] (03PS3) 10Hnowlan: Apply videoscaler request limits and wall clock time limits to shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073840 (https://phabricator.wikimedia.org/T373517) [15:55:05] (03CR) 10Alexandros Kosiaris: "Wording nitpick, but the rest LGTM. Feel free to merge" [puppet] - 10https://gerrit.wikimedia.org/r/1073838 (https://phabricator.wikimedia.org/T348876) (owner: 10Elukey) [15:55:38] !log deploy python3-setuptools upgrades fleetwide [15:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:14] (03CR) 10Btullis: [V:03+1 C:03+2] Permit db1208 to access the Ceph/S3 endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1073837 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [15:58:09] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: Move servers in codfw rack D5 [15:58:12] (03PS2) 10Elukey: profile::docker::reporter: fix k8s_rules.ini [puppet] - 10https://gerrit.wikimedia.org/r/1073838 (https://phabricator.wikimedia.org/T348876) [15:58:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: Move servers in codfw rack D5 [15:58:43] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: fix k8s_rules.ini (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1073838 (https://phabricator.wikimedia.org/T348876) (owner: 10Elukey) [15:58:43] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10157915 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7e878ed4-7126-4f45-87aa-d1087aacf81a) set by cmooney@cumin100... [16:00:27] !log moving servers in codfw rack D5 from asw-d5-codfw to lsw1-d5-codfw T373104 [16:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:42] T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104 [16:01:03] !log hashar@deploy1003 Finished scap sync-world: Update termbox (mul support) - T373088 (duration: 06m 48s) [16:01:16] T373088: [MUL] placeholder labels not appearing on mobile - https://phabricator.wikimedia.org/T373088 [16:01:24] Lucas_WMDE: I think I have deployed it [16:01:44] no test servers? [16:02:07] since there were two patches I went to do a submodule update and `scap sync-world` [16:02:15] which well yeah, deploys straight to everything [16:02:16] :/ [16:02:23] `scap backport` would’ve supported that AFAIK [16:02:30] you can specify more than one URL (or patch number) [16:02:38] but anyway, with ?debug=2 the new JS code seems to work \o/ [16:02:51] * hashar blames cache [16:02:55] (ah, and with ?action=purge too) [16:02:56] great! thank you for the verification [16:03:04] thanks for deploying! [16:03:22] with https://m.wikidata.org/wiki/Q42?q=SELECT%20*; it works as well [16:03:55] so that is cached in the frontend cache [16:04:38] (03CR) 10Dzahn: "I am a bit conflicted here. We actually did not see matching throttle events in the dashboards after all. It seems like it could have also" [puppet] - 10https://gerrit.wikimedia.org/r/1073740 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [16:05:26] (03PS1) 10Bking: rdf-streaming-updater: remove references to old-style network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073842 (https://phabricator.wikimedia.org/T373195) [16:06:13] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:25:00 on 24 hosts with reason: Move servers in codfw rack D6 [16:06:27] (03PS15) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [16:06:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on 24 hosts with reason: Move servers in codfw rack D6 [16:06:45] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10157936 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9cef1cb8-6d99-4d39-b2db-e242da2fe3f6) set by cmooney@cumin100... [16:07:10] !log moving servers in codfw rack D6 from asw-d6-codfw to lsw1-d6-codfw T373104 [16:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:14] T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104 [16:07:24] (03CR) 10Bking: [C:03+2] rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [16:08:21] (03CR) 10Dzahn: "If we merge this now we might end up in situation where it doesn't happen again but we never know why and if it was an unrelated glitch or" [puppet] - 10https://gerrit.wikimedia.org/r/1073740 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [16:10:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10157954 (10phaultfinder) [16:12:45] (03CR) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [16:14:15] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:03] (03PS1) 10Dreamy Jazz: Autopromote users into checkuser-temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073844 (https://phabricator.wikimedia.org/T369187) [16:15:21] jouncebot: nowandnext [16:15:21] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [16:15:21] In 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700) [16:21:28] (03PS2) 10Dreamy Jazz: Autopromote users into checkuser-temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073844 (https://phabricator.wikimedia.org/T369187) [16:21:48] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10157981 (10cmooney) All hosts moved and responding to ping again. Thanks all for the help! [16:23:14] (03CR) 10Volans: "clarifications inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [16:23:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69262 and previous config saved to /var/cache/conftool/dbconfig/20240918-162316-arnaudb.json [16:23:21] T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104 [16:23:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69263 and previous config saved to /var/cache/conftool/dbconfig/20240918-162321-arnaudb.json [16:23:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69264 and previous config saved to /var/cache/conftool/dbconfig/20240918-162326-arnaudb.json [16:23:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69265 and previous config saved to /var/cache/conftool/dbconfig/20240918-162331-arnaudb.json [16:23:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69266 and previous config saved to /var/cache/conftool/dbconfig/20240918-162341-arnaudb.json [16:23:47] jouncebot: nowandnext [16:23:47] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [16:23:47] In 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700) [16:23:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69267 and previous config saved to /var/cache/conftool/dbconfig/20240918-162346-arnaudb.json [16:23:51] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2013.codfw.wmnet [16:23:51] Going to deploy now [16:23:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69268 and previous config saved to /var/cache/conftool/dbconfig/20240918-162351-arnaudb.json [16:23:53] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2013.codfw.wmnet [16:23:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69269 and previous config saved to /var/cache/conftool/dbconfig/20240918-162357-arnaudb.json [16:24:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69270 and previous config saved to /var/cache/conftool/dbconfig/20240918-162401-arnaudb.json [16:24:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69271 and previous config saved to /var/cache/conftool/dbconfig/20240918-162406-arnaudb.json [16:24:22] (03CR) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [16:25:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073844 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz) [16:25:36] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2014.codfw.wmnet [16:25:38] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2014.codfw.wmnet [16:25:52] (03Merged) 10jenkins-bot: Autopromote users into checkuser-temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073844 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz) [16:25:53] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10158024 (10ABran-WMF) nodes repooling, haproxy reloaded, thanks for the update @cmooney @Ladsgroup I'll get to T375050 [16:25:54] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2024.codfw.wmnet [16:25:56] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2024.codfw.wmnet [16:26:12] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2048.codfw.wmnet [16:26:14] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073844|Autopromote users into checkuser-temporary-account-viewer (T369187 T327913)]] [16:26:14] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2048.codfw.wmnet [16:26:23] T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187 [16:26:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T375050 [16:26:24] T327913: Assign checkuser-temporary-account right to various groups - https://phabricator.wikimedia.org/T327913 [16:26:30] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2049.codfw.wmnet [16:26:31] T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050 [16:26:32] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2049.codfw.wmnet [16:26:47] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2050.codfw.wmnet [16:26:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T375050 [16:26:49] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2050.codfw.wmnet [16:27:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2218 with weight 0 T375050', diff saved to https://phabricator.wikimedia.org/P69272 and previous config saved to /var/cache/conftool/dbconfig/20240918-162703-arnaudb.json [16:27:05] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2051.codfw.wmnet [16:27:07] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2051.codfw.wmnet [16:27:23] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2444.codfw.wmnet [16:27:25] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2444.codfw.wmnet [16:27:40] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2445.codfw.wmnet [16:27:42] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2445.codfw.wmnet [16:27:58] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2446.codfw.wmnet [16:28:00] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2446.codfw.wmnet [16:28:04] (03PS5) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 [16:28:15] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2447.codfw.wmnet [16:28:18] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2447.codfw.wmnet [16:28:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2218 from API/vslow/dump T375050', diff saved to https://phabricator.wikimedia.org/P69273 and previous config saved to /var/cache/conftool/dbconfig/20240918-162822-arnaudb.json [16:28:31] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1073844|Autopromote users into checkuser-temporary-account-viewer (T369187 T327913)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:28:33] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2448.codfw.wmnet [16:28:35] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2448.codfw.wmnet [16:28:50] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2449.codfw.wmnet [16:28:52] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2449.codfw.wmnet [16:29:08] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2450.codfw.wmnet [16:29:10] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2450.codfw.wmnet [16:29:26] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2451.codfw.wmnet [16:29:28] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2451.codfw.wmnet [16:29:44] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2016.codfw.wmnet [16:29:46] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2016.codfw.wmnet [16:29:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:30:01] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2017.codfw.wmnet [16:30:03] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2017.codfw.wmnet [16:32:14] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1073699 (https://phabricator.wikimedia.org/T375050) (owner: 10Gerrit maintenance bot) [16:33:25] !log Starting s7 codfw failover from db2220 to db2218 - T375050 [16:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:30] T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050 [16:34:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2218 to s7 primary T375050', diff saved to https://phabricator.wikimedia.org/P69274 and previous config saved to /var/cache/conftool/dbconfig/20240918-163404-arnaudb.json [16:35:32] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [16:36:16] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10158110 (10phaultfinder) [16:36:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 T375050', diff saved to https://phabricator.wikimedia.org/P69275 and previous config saved to /var/cache/conftool/dbconfig/20240918-163637-arnaudb.json [16:37:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 5%: T375050', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240918-163721-arnaudb.json [16:38:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69276 and previous config saved to /var/cache/conftool/dbconfig/20240918-163822-arnaudb.json [16:38:27] T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104 [16:38:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69277 and previous config saved to /var/cache/conftool/dbconfig/20240918-163827-arnaudb.json [16:38:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69278 and previous config saved to /var/cache/conftool/dbconfig/20240918-163832-arnaudb.json [16:38:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69279 and previous config saved to /var/cache/conftool/dbconfig/20240918-163837-arnaudb.json [16:38:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69280 and previous config saved to /var/cache/conftool/dbconfig/20240918-163847-arnaudb.json [16:38:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69281 and previous config saved to /var/cache/conftool/dbconfig/20240918-163852-arnaudb.json [16:38:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69282 and previous config saved to /var/cache/conftool/dbconfig/20240918-163857-arnaudb.json [16:39:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69283 and previous config saved to /var/cache/conftool/dbconfig/20240918-163902-arnaudb.json [16:39:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69284 and previous config saved to /var/cache/conftool/dbconfig/20240918-163907-arnaudb.json [16:40:21] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073844|Autopromote users into checkuser-temporary-account-viewer (T369187 T327913)]] (duration: 14m 06s) [16:40:26] T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187 [16:40:26] T327913: Assign checkuser-temporary-account right to various groups - https://phabricator.wikimedia.org/T327913 [16:42:50] !log sudo cumin "A:cp" 'disable-puppet "merging CR 1073453"': T347114 [16:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:54] T347114: NetworkProbeLimit cookie for Probenet overwritten on every link hover event - https://phabricator.wikimedia.org/T347114 [16:43:19] (03PS6) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 [16:43:47] (03CR) 10Ssingh: "CI failure expected as US_DATACENTERS does not exist in the currently deployed version of wmflib. We will recheck but I wanted to get this" [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [16:45:38] (03PS2) 10AOkoth: wmnet: change ticket to vrts1003 [dns] - 10https://gerrit.wikimedia.org/r/1073490 (https://phabricator.wikimedia.org/T373420) [16:45:42] (03CR) 10Ssingh: [C:03+2] NetworkProbeLimit Cookie: avoid nop re-set-cookie [puppet] - 10https://gerrit.wikimedia.org/r/1073453 (https://phabricator.wikimedia.org/T347114) (owner: 10BBlack) [16:46:40] we are debugging why sirenbot is doing that [16:46:48] for some reason it can't write to the local sqlite db [16:46:55] oh good catch, not sure [16:46:59] but there seems to be no reason for that [16:47:12] same permissions compared to other host [16:47:14] same sqlite and all [16:47:41] so it joins, reads the channel topic, wants to write it to sqlite and fails [16:47:53] the db file is there but empty.. [16:50:03] there are no tables mutante, probably it needs them [16:50:07] error="no such table: topics" [16:50:37] I guess it needed some schema pre-loaded in the db or some init to call or carry over the pre-existing db in another host [16:50:50] * volans not familiar with it, so just mentioning common scenarios [16:50:52] volans: good find, thanks. So I guess we could just copy the file from the other host.. but it's still a mystery since denisse reports they didnt have to do that last time and it all just worked without doing that [16:51:06] the "maybe it needs an init somehow" was already a guess [16:51:43] What makes this weird is that we didn't had to copy the DB when we failed over to alert2002 last week. [16:51:52] So I'm not sure what's the root cause of the issue. [16:52:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 10%: T375050', diff saved to https://phabricator.wikimedia.org/P69285 and previous config saved to /var/cache/conftool/dbconfig/20240918-165232-arnaudb.json [16:52:38] T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050 [16:53:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69286 and previous config saved to /var/cache/conftool/dbconfig/20240918-165327-arnaudb.json [16:53:32] T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104 [16:53:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69287 and previous config saved to /var/cache/conftool/dbconfig/20240918-165332-arnaudb.json [16:53:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69288 and previous config saved to /var/cache/conftool/dbconfig/20240918-165337-arnaudb.json [16:53:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69289 and previous config saved to /var/cache/conftool/dbconfig/20240918-165344-arnaudb.json [16:53:51] denisse: is there some sync mechanism that keeps the db in sync between hosts? or was it tested on that host so that a local db was already there, [16:53:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69290 and previous config saved to /var/cache/conftool/dbconfig/20240918-165352-arnaudb.json [16:53:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69291 and previous config saved to /var/cache/conftool/dbconfig/20240918-165357-arnaudb.json [16:54:02] or just a puppetization error that did it there but not here [16:54:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69292 and previous config saved to /var/cache/conftool/dbconfig/20240918-165403-arnaudb.json [16:54:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69293 and previous config saved to /var/cache/conftool/dbconfig/20240918-165407-arnaudb.json [16:54:11] we already deleted the emtpy db file and let puppet run again [16:54:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69294 and previous config saved to /var/cache/conftool/dbconfig/20240918-165412-arnaudb.json [16:54:28] but yea, I guess let's just copy it regardless [16:54:41] why not keep the topic data, right? [16:55:19] (03PS3) 10JMeybohm: Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) [16:55:36] (03CR) 10JMeybohm: Fix ferm_status to actually compare rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [16:55:39] the real question seems what initially creates the tables [16:56:22] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [16:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:56:50] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [16:56:52] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [16:56:56] (03CR) 10CI reject: [V:04-1] sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [16:57:28] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [16:58:20] (03CR) 10CI reject: [V:04-1] Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [16:59:33] volans: There doesn't seem to be a sync mechanism between them. When we failed over to alert2002 last week (same setup) the issue didn't happen, the tables were created correctly and the DB was populated with data. [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700) [17:02:24] !log copied vopsbot.db from alert1001 to alert1002; restarted vopsbot [17:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:21] looks like it's working. swfrench-wmf appears in the topic db [17:06:03] (03CR) 10Bking: [C:03+1] wdqs max lag: break up extremely long line [alerts] - 10https://gerrit.wikimedia.org/r/1073534 (owner: 10Ryan Kemper) [17:07:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 15%: T375050', diff saved to https://phabricator.wikimedia.org/P69296 and previous config saved to /var/cache/conftool/dbconfig/20240918-170738-arnaudb.json [17:07:43] T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050 [17:08:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69297 and previous config saved to /var/cache/conftool/dbconfig/20240918-170833-arnaudb.json [17:08:38] T373104: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104 [17:08:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69298 and previous config saved to /var/cache/conftool/dbconfig/20240918-170838-arnaudb.json [17:08:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69299 and previous config saved to /var/cache/conftool/dbconfig/20240918-170843-arnaudb.json [17:08:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69300 and previous config saved to /var/cache/conftool/dbconfig/20240918-170849-arnaudb.json [17:08:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69301 and previous config saved to /var/cache/conftool/dbconfig/20240918-170858-arnaudb.json [17:09:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69302 and previous config saved to /var/cache/conftool/dbconfig/20240918-170903-arnaudb.json [17:09:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69303 and previous config saved to /var/cache/conftool/dbconfig/20240918-170909-arnaudb.json [17:09:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69304 and previous config saved to /var/cache/conftool/dbconfig/20240918-170913-arnaudb.json [17:09:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69305 and previous config saved to /var/cache/conftool/dbconfig/20240918-170918-arnaudb.json [17:14:11] jouncebot: nowandnext [17:14:11] For the next 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700) [17:14:11] In 0 hour(s) and 45 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1800) [17:17:12] (03PS1) 10Dreamy Jazz: Revert^2 "Create group for assigning checkuser-temporary-account right" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073853 (https://phabricator.wikimedia.org/T369187) [17:17:24] (03CR) 10Dreamy Jazz: [C:03+2] Revert^2 "Create group for assigning checkuser-temporary-account right" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073853 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz) [17:17:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073853 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz) [17:19:51] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:20:22] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:22:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 25%: T375050', diff saved to https://phabricator.wikimedia.org/P69306 and previous config saved to /var/cache/conftool/dbconfig/20240918-172243-arnaudb.json [17:22:49] T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050 [17:24:12] (03CR) 10Bking: airflow: allow the webserver and scheduler to be selectively deployed (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) (owner: 10Brouberol) [17:25:40] (03PS1) 10Ssingh: varnish: fix regex for NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/1073854 [17:25:54] (03PS1) 10Xcollazo: Declare stream 'mediawiki.dump.revision_history.reconcile.v1.rc0' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) [17:26:30] (03CR) 10Ssingh: [C:03+2] varnish: fix regex for NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/1073854 (owner: 10Ssingh) [17:29:16] !log re-enable puppet on A:cp to finish rolling out T368755 [17:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:20] T368755: Python job that reads from wmf_dumps.wikitext_inconsistent_row and calls EventGate - https://phabricator.wikimedia.org/T368755 [17:29:42] that's the wrong one [17:29:53] !log re-enable puppet on A:cp to finish rolling out T347114 [17:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:57] T347114: NetworkProbeLimit cookie for Probenet overwritten on every link hover event - https://phabricator.wikimedia.org/T347114 [17:37:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 50%: T375050', diff saved to https://phabricator.wikimedia.org/P69308 and previous config saved to /var/cache/conftool/dbconfig/20240918-173749-arnaudb.json [17:37:54] T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050 [17:39:13] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073818 (https://phabricator.wikimedia.org/T373976) (owner: 10Elukey) [17:40:45] (03PS1) 10JMeybohm: wikikube: Remove remaining hiera files and role for non stacked masters [puppet] - 10https://gerrit.wikimedia.org/r/1073857 (https://phabricator.wikimedia.org/T353464) [17:42:59] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073802 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [17:44:12] (03PS2) 10JMeybohm: wikikube: Remove remaining hiera files and role for non stacked masters [puppet] - 10https://gerrit.wikimedia.org/r/1073857 (https://phabricator.wikimedia.org/T353464) [17:44:12] (03PS1) 10JMeybohm: wikikube: Disable requestctl ferm rules and definitions [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366) [17:44:43] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [17:46:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [17:47:12] (03Merged) 10jenkins-bot: Revert^2 "Create group for assigning checkuser-temporary-account right" [extensions/CheckUser] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073853 (https://phabricator.wikimedia.org/T369187) (owner: 10Dreamy Jazz) [17:47:31] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1073853|Revert^2 "Create group for assigning checkuser-temporary-account right" (T369187)]] [17:47:35] T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187 [17:49:41] (03PS4) 10JMeybohm: Fix ferm_status to actually compare rules [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) [17:49:42] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1073853|Revert^2 "Create group for assigning checkuser-temporary-account right" (T369187)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:49:42] (03PS3) 10JMeybohm: wikikube: Remove remaining hiera files and role for non stacked masters [puppet] - 10https://gerrit.wikimedia.org/r/1073857 (https://phabricator.wikimedia.org/T353464) [17:49:42] (03PS2) 10JMeybohm: wikikube: Disable requestctl ferm rules and definitions [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366) [17:51:02] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [17:52:06] (03PS3) 10JMeybohm: wikikube: Disable requestctl ferm rules and definitions [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366) [17:52:13] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [17:52:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 75%: T375050', diff saved to https://phabricator.wikimedia.org/P69309 and previous config saved to /var/cache/conftool/dbconfig/20240918-175255-arnaudb.json [17:52:59] T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050 [17:55:49] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073853|Revert^2 "Create group for assigning checkuser-temporary-account right" (T369187)]] (duration: 08m 18s) [17:55:54] T369187: Allow users to be autopromoted into checkuser-temporary-account-viewer group based on local criteria - https://phabricator.wikimedia.org/T369187 [17:56:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1073760 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [17:56:03] Finished my deploys for now [17:56:17] jouncebot: nowandnext [17:56:17] For the next 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1700) [17:56:17] In 0 hour(s) and 3 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1800) [18:00:05] jnuche and dduvall: Your horoscope predicts another MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T1800). [18:01:02] (03PS1) 10Muehlenhoff: Revert "Remove puppetmaster1003 from active Puppet 5 servers" [puppet] - 10https://gerrit.wikimedia.org/r/1073860 (https://phabricator.wikimedia.org/T373888) [18:04:49] (03CR) 10Dzahn: "This second option seems like it would require some more changes because currently the class httpd is instantiated inside the class profil" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:04:57] (03PS3) 10Dzahn: gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) [18:05:19] (03CR) 10CI reject: [V:04-1] gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:05:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes2056:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2056 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:06:17] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10158427 (10RobH) So the SEL/idrac logs show no thermal events, and dell support is attempting to deny these support requests. On checking cp3071, I don't see any thermal events in the logs: ` r... [18:08:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 100%: T375050', diff saved to https://phabricator.wikimedia.org/P69310 and previous config saved to /var/cache/conftool/dbconfig/20240918-180800-arnaudb.json [18:08:06] T375050: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T375050 [18:10:38] (03PS4) 10Dzahn: gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) [18:11:00] (03CR) 10CI reject: [V:04-1] gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:11:52] (03PS5) 10Dzahn: gerrit::proxy: files managed under /var/www/ require httpd [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) [18:12:46] (03CR) 10Scott French: [C:03+1] services: remove old poolcounter nodes from MW's net policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073802 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [18:20:41] (03PS6) 10Dzahn: gerrit::proxy: ensure /var/www/ exists before files under it [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) [18:21:02] (03CR) 10CI reject: [V:04-1] gerrit::proxy: ensure /var/www/ exists before files under it [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:23:10] (03PS7) 10Dzahn: gerrit::proxy: ensure /var/www/ exists before files under it [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) [18:23:44] (03CR) 10CDanis: [C:03+1] wikikube: Disable requestctl ferm rules and definitions [puppet] - 10https://gerrit.wikimedia.org/r/1073859 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [18:24:48] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1073305/4028/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1073305 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:35:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubernetes2056:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2056 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:36:03] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - elastic1089 - https://phabricator.wikimedia.org/T374897#10158542 (10Dzahn) [18:38:29] 06SRE, 10Observability-Metrics, 13Patch-For-Review: Audit hosts in 'misc' cluster - https://phabricator.wikimedia.org/T375066#10158541 (10Dzahn) Would it make sense to have clusters named after a team to group all machines owned by a specific subteam? Or would that go against the purpose of clusters and th... [18:45:47] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10158573 (10Dzahn) @Seddon Could you confirm if the google groups work for you and you are receiving mails there? I think if that's the... [18:46:04] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10158575 (10Dzahn) 05Open→03In progress [18:46:16] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10158576 (10Dzahn) a:03Seddon [19:08:57] (03PS1) 10C. Scott Ananian: Re-order arguments to DataAccess::addTrackingCategory [core] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073871 [19:39:02] (03PS1) 10Sohom Datta: Bring back quality colors before dark mode fixes [extensions/ProofreadPage] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1073879 (https://phabricator.wikimedia.org/T375114) [19:39:03] (03PS1) 10Mforns: Modify service commons-impact-analytics to use data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073880 (https://phabricator.wikimedia.org/T368035) [19:53:47] (03PS1) 10JHathaway: ci: fix bundle on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1073893 [19:57:41] (03CR) 10JHathaway: [C:03+2] ci: fix bundle on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1073893 (owner: 10JHathaway) [19:57:58] (03CR) 10Gmodena: Declare stream 'mediawiki.dump.revision_history.reconcile.v1.rc0' (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T2000). nyaa~ [20:00:04] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:54] (03PS1) 10JHathaway: WIP - test [puppet] - 10https://gerrit.wikimedia.org/r/1073896 [20:01:50] o/ [20:03:07] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1073896 (owner: 10JHathaway) [20:09:14] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375130 (10phaultfinder) 03NEW [20:15:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:03] (03Abandoned) 10JHathaway: WIP - test [puppet] - 10https://gerrit.wikimedia.org/r/1073896 (owner: 10JHathaway) [20:17:38] (03PS5) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [20:18:55] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:25:40] Hi everyone we're gonna do some deploys! [20:25:47] Jdlrobson: lmk when you're here and ready [20:26:52] toyofuku: ready [20:26:59] Sounds good [20:27:16] Any reason they shouldn't all go out in one batch? I'm guessing not based on my understanding [20:29:00] While we wait, the song rec of the day is Se Me Olvida by Maisak and Feid [20:29:06] (03CR) 10CI reject: [V:04-1] sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [20:30:10] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:30:57] toyofuku: they can all go out together [20:31:06] roger that [20:31:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073836 (https://phabricator.wikimedia.org/T370099) (owner: 10Jdlrobson) [20:31:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073835 (https://phabricator.wikimedia.org/T374255) (owner: 10Jdlrobson) [20:31:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073839 (https://phabricator.wikimedia.org/T374654) (owner: 10Jdlrobson) [20:32:04] (03Merged) 10jenkins-bot: Deploy Vector 2022 on several Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073835 (https://phabricator.wikimedia.org/T374255) (owner: 10Jdlrobson) [20:32:08] (03Merged) 10jenkins-bot: Enable dark mode for all logged in users on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073836 (https://phabricator.wikimedia.org/T370099) (owner: 10Jdlrobson) [20:32:09] (03Merged) 10jenkins-bot: Limit quick surveys to wikis with messages defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073839 (https://phabricator.wikimedia.org/T374654) (owner: 10Jdlrobson) [20:32:32] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1073836|Enable dark mode for all logged in users on all projects (T370099)]], [[gerrit:1073835|Deploy Vector 2022 on several Wikimedia wikis (T374255)]], [[gerrit:1073839|Limit quick surveys to wikis with messages defined (T374654)]] [20:32:39] T370099: Roll out dark mode to all projects (non-Wikipedia sites, logged-in users) - https://phabricator.wikimedia.org/T370099 [20:32:39] T374255: Deploy Vector 2022 on small wikis - https://phabricator.wikimedia.org/T374255 [20:32:40] T374654: Log messages at ERROR level on QuickSurvey channel: "Bad survey configuration: The XXX external survey must have a secure url." - https://phabricator.wikimedia.org/T374654 [20:35:00] !log toyofuku@deploy1003 toyofuku, jdlrobson: Backport for [[gerrit:1073836|Enable dark mode for all logged in users on all projects (T370099)]], [[gerrit:1073835|Deploy Vector 2022 on several Wikimedia wikis (T374255)]], [[gerrit:1073839|Limit quick surveys to wikis with messages defined (T374654)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:35:09] Jdlrobson: we're on test servers! [20:35:55] While we wait for him to test, another banger: Yayo by Rema [20:36:30] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10158825 (10phaultfinder) [20:38:49] (toyofuku: looking) [20:38:57] ty ty [20:40:23] ok all LGTM toyofuku please sync! [20:40:29] on it [20:40:31] !log toyofuku@deploy1003 toyofuku, jdlrobson: Continuing with sync [20:43:07] (03PS6) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [20:45:15] PROBLEM - Ensure acme-chief-api is running on acmechief1002 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [20:45:24] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1073836|Enable dark mode for all logged in users on all projects (T370099)]], [[gerrit:1073835|Deploy Vector 2022 on several Wikimedia wikis (T374255)]], [[gerrit:1073839|Limit quick surveys to wikis with messages defined (T374654)]] (duration: 12m 52s) [20:45:31] T370099: Roll out dark mode to all projects (non-Wikipedia sites, logged-in users) - https://phabricator.wikimedia.org/T370099 [20:45:31] T374255: Deploy Vector 2022 on small wikis - https://phabricator.wikimedia.org/T374255 [20:45:32] T374654: Log messages at ERROR level on QuickSurvey channel: "Bad survey configuration: The XXX external survey must have a secure url." - https://phabricator.wikimedia.org/T374654 [20:45:59] Jdlrobson: all done! [20:46:15] RECOVERY - Ensure acme-chief-api is running on acmechief1002 is OK: PROCS OK: 1 process with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [20:46:24] thanks toyofuku ! [20:46:26] Appreciated! [20:46:34] (03PS1) 10Scott French: mw-(api-ext|web): scale up to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073904 (https://phabricator.wikimedia.org/T371273) [20:46:37] 🫡 [20:48:48] (03CR) 10Scott French: "This will be merged and applied ahead of depooling the RO services in codfw tomorrow. Thanks in advance for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073904 (https://phabricator.wikimedia.org/T371273) (owner: 10Scott French) [20:52:28] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374741#10158902 (10Jhancock.wm) a:03Papaul This has been phsyically decommed, and offline in netbox. @papaul, it is ready for you to remove the e... [20:52:36] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374741#10158910 (10Jhancock.wm) [20:55:27] (03CR) 10CI reject: [V:04-1] sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [20:55:31] (03PS7) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [20:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:58:33] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[4-5] - https://phabricator.wikimedia.org/T372512#10158928 (10Jhancock.wm) a:03Jhancock.wm [20:58:52] swfrench-wmf: about to start the circular replication cookbook [20:59:12] !log ladsgroup@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw [20:59:34] Amir1: ack, thanks for the heads-up! [20:59:55] is this the first time it's been used "for real"? [20:59:59] yes [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240918T2100) [21:00:38] * swfrench-wmf grabs popcorn [21:00:44] looking good s1 [21:00:59] s1 is good [21:01:04] I wait a bit before next one [21:01:09] in case things break [21:01:13] yep [21:03:29] btw, these replicas are depooled. Some I know why but some I'm not sure: https://phabricator.wikimedia.org/P69307 [21:03:45] logs look good [21:04:19] let's add those to review, will ask A. too, as there is a lot of ongoing decoms etc [21:04:48] thanks [21:04:51] https://www.irccloud.com/pastebin/VtxVUtCO/ [21:05:11] certainly we should do a general review of hosts and weights [21:05:17] this is the most important part for me. Root is only us, but if it's RW, then mw might write stuff [21:05:21] manuel used to do those before switch [21:05:33] yeah [21:05:37] moving on to s2 [21:06:29] I checked also mw logs, nothing worring there [21:09:59] the errors in s2 are because of this: https://phabricator.wikimedia.org/T374852#10158957 [21:10:22] (03CR) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [21:11:16] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2138.codfw.wmnet - https://phabricator.wikimedia.org/T374852#10158957 (10Ladsgroup) You need to remove them from orchestrator too: https://wikitech.wikimedia.org/wiki/MariaDB/Decommissioning_a_DB_Host#Remove_host_from_orchestrat... [21:11:30] (03PS8) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [21:13:41] Amir1: just to be clear, you mean orch errors, no script/app errors, right? [21:13:47] yeah [21:13:52] this [21:13:55] ok, thanks [21:14:03] https://usercontent.irccloud-cdn.com/file/6txoYbrg/grafik.png [21:14:07] that makes me not worry [21:15:01] (03PS7) 10Andrea Denisse: alert: Ensure Prometheus Alertmanager starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/1073903 (https://phabricator.wikimedia.org/T375138) [21:15:01] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1073903/4035/" [puppet] - 10https://gerrit.wikimedia.org/r/1073903 (https://phabricator.wikimedia.org/T375138) (owner: 10Andrea Denisse) [21:15:20] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2138.codfw.wmnet - https://phabricator.wikimedia.org/T374852#10158977 (10Jhancock.wm) 05Open→03Resolved [21:15:42] s4 now [21:18:47] s5 [21:19:03] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2137.codfw.wmnet - https://phabricator.wikimedia.org/T374851#10158988 (10Jhancock.wm) 05Open→03Resolved [21:20:29] s6 [21:23:55] RESOLVED: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:04] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2127.codfw.wmnet - https://phabricator.wikimedia.org/T374849#10158999 (10Jhancock.wm) 05Open→03Resolved [21:27:09] only x1 left? [21:28:01] x1 is done [21:28:06] RW ES sections left [21:28:19] es6 and es7 [21:28:41] there may be an issue on x1 [21:29:00] the primary master is not replicating [21:29:29] yeah and es6 too [21:30:08] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2125.codfw.wmnet - https://phabricator.wikimedia.org/T374848#10159021 (10Jhancock.wm) 05Open→03Resolved [21:30:09] not a breaking thing, but let me help investigate (I won't touch anything) [21:30:20] feel free to touch anything :D [21:30:41] (03CR) 10Bking: [C:03+1] airflow: allow the webserver and scheduler to be selectively deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073447 (https://phabricator.wikimedia.org/T374936) (owner: 10Brouberol) [21:31:47] Could not execute Update_rows_v1 event on table heartbeat.heartbeat; Can't find record in 'heartbeat' [21:31:57] heartbeat table wasn't properly cleaned up [21:33:24] I know why [21:33:37] x1 and es use row based replicatin [21:34:05] this means that the REPLACE gets translated into update row [21:34:20] but that row doesn't exist on eqiad [21:34:24] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2121.codfw.wmnet - https://phabricator.wikimedia.org/T374845#10159058 (10Jhancock.wm) 05Open→03Resolved [21:34:45] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2122.codfw.wmnet - https://phabricator.wikimedia.org/T374846#10159062 (10Jhancock.wm) 05Open→03Resolved a:05ABran-WMF→03Jhancock.wm [21:35:14] this is an easy fix, but given nothing is broken, let me make sure I fix it rightly and I don't break the eqiad replicas (I just need to insert a row on eqiad master) [21:35:17] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2124.codfw.wmnet - https://phabricator.wikimedia.org/T374847#10159060 (10Jhancock.wm) 05Open→03Resolved [21:39:50] Amir1: ok for me to apply the change to x1 master eqiad and restart replication ? [21:39:57] yeah sure [21:40:26] x1 done [21:40:50] Thanks! [21:41:00] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159081 (10phaultfinder) [21:42:19] PROBLEM - MariaDB Replica SQL: x1 on db2196 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table heartbeat.heartbeat: Duplicate entry 180360966 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1220-bin.015294, end_log_pos 946695232 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:43:11] PROBLEM - Disk space on seaborgium is CRITICAL: DISK CRITICAL - free space: / 718 MB (3% inode=92%): /tmp 718 MB (3% inode=92%): /var/tmp 718 MB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=seaborgium&var-datasource=eqiad+prometheus/ops [21:43:43] oh no [21:43:57] heartbeat dupe is weird! [21:44:23] jynus: is the alert about x1 just a delayed response to what you've now repaired manually? [21:44:34] yeah, but it is causing fallout [21:45:37] (03PS1) 10JHathaway: ci: upgrade to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490) [21:45:50] ack, let me know if you need more hands / eyes on anything [21:46:05] (03CR) 10CI reject: [V:04-1] ci: upgrade to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [21:47:15] (03PS2) 10JHathaway: ci: upgrade to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490) [21:47:45] (03CR) 10CI reject: [V:04-1] ci: upgrade to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1073906 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [21:48:19] RECOVERY - MariaDB Replica SQL: x1 on db2196 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:48:21] okay, there are two rows in eqiad master [21:48:25] but one row in codfw [21:48:27] I have deployed a temporary fix, which is replicate-wild-ignore-table=heartbeat.% [21:48:44] and that sould work for mediawiki, but we are in a weird state [21:49:39] the solution would be easy if we were on statement [21:49:52] because of the replaces [21:51:50] I wonder how to best move forward with es [21:53:23] because what I've done for x1 is just delay the issue until switchover [21:55:31] !log seaborgium - apt-get clean (disk space before: 98% used, now: 76% used, was alerting) [21:55:33] I think the right way to fix it is to undo the circular replication and insert the row or insert it witout logging [21:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:13] Amir1: on your side is everything finished? [21:57:21] I haven't done s7 yet [21:57:25] es7 [21:57:31] should I do and then we revert it [21:57:33] yeah sorry [21:57:53] do you know the funny thing- this happened because heartbeat was cleaned :-D [21:58:16] if it was "dirty" it would have worked [21:58:56] :(( [22:00:32] let me try to fix es6 in a cleaner way [22:00:48] by inserting without binlog a new codfw row [22:02:23] thanks [22:03:11] RECOVERY - Disk space on seaborgium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=seaborgium&var-datasource=eqiad+prometheus/ops [22:05:52] yeah, that works for es6 [22:06:09] will do it now for es7 ahead of the circular [22:06:26] Thanks [22:06:36] let me know once you're done and I will run it [22:06:58] yep, taking my time, to make sure I dont break stuff [22:07:44] (03CR) 10RLazarus: [C:03+1] mw-(api-ext|web): scale up to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073904 (https://phabricator.wikimedia.org/T371273) (owner: 10Scott French) [22:11:08] Amir1: done, you should be good to go and it should just work [22:11:18] going [22:11:32] waiting to check everthing is ok before fixing x1 for real [22:11:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from eqiad to codfw [22:12:05] done ^ [22:12:16] and this time it looks good [22:12:50] ok, the fixing x1 codfw to remove the ignore table [22:13:16] will do the same thing, apply the change without logs on all codfw hosts [22:17:47] thanks [22:18:50] I should have logged all of this [22:19:12] !log inserting without binlog missing heartbeat reecod on x1 codfw hosts [22:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:36] and we should be ok [22:23:03] as in healthy/as expected/no hidden bomb [22:23:28] and as we should after circulat replication everywhere it should [22:24:31] my first thought of why this happened is that either there was something for ROW that we missed or this was a "hidden" bomb after cleaning up heartbeat, that only showed up in this one [22:25:10] and we should either not cleanup ROW replicas or add the record beforehand [22:25:26] maybe it was something else, but this is my first impression [22:25:33] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker1007.eqiad.wmnet, mw1477.eqiad.wmnet, parse1013.eqiad.wmnet, mw1380.eqiad.wmnet, mw1448.eqiad.wmnet, parse1007.eqiad.wmnet, mw1451.eqiad.wmnet, wikikube-worker1020.eqiad.wmnet, mw1367.eqiad.wmnet, mw1475.eqiad.wmnet, mw1459.eqiad.wmnet, parse1011.eqiad.wmnet, mw1476.eqiad.wmnet, kubernetes1062.eqiad.wmnet, k [22:25:34] 1022.eqiad.wmnet, mw1384.eqiad.wmnet, mw1479.eqiad.wmnet, mw1378.eqiad.wmnet, mw1462.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, mw1388.eqiad.wmnet, mw1480.eqiad.wmnet, mw1482.eqiad.wmnet, parse1009.eqiad.wmnet, kubernetes1040.eqiad.wmnet, mw1405.eqiad.wmnet, mw1495.eqiad.wmnet, kubernetes1030.eqiad.wmnet, kubernetes1038.eqiad.wmnet, mw1424.eqiad.wmnet, mw1461.eqiad.wmnet, mw1488.eqiad.wmnet, parse1010.eqiad.wmnet, wikikube-work [22:25:34] iad.wmnet, mw1465.eqiad.wmnet, wikikube-worker1018.eqiad.wmnet, mw1389.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, parse1012.eqiad.wmnet, wikikube-worker1025.eqiad.wmnet, mw149 https://wikitech.wikimedia.org/wiki/PyBal [22:25:37] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker1007.eqiad.wmnet, mw1451.eqiad.wmnet, mw1433.eqiad.wmnet, mw1380.eqiad.wmnet, mw1462.eqiad.wmnet, mw1457.eqiad.wmnet, mw1455.eqiad.wmnet, mw1475.eqiad.wmnet, mw1374.eqiad.wmnet, wikikube-worker1013.eqiad.wmnet, parse1011.eqiad.wmnet, mw1439.eqiad.wmnet, kubernetes1011.eqiad.wmnet, wikikube-worker1029.eqiad.w [22:25:37] 386.eqiad.wmnet, mw1384.eqiad.wmnet, parse1013.eqiad.wmnet, mw1479.eqiad.wmnet, mw1470.eqiad.wmnet, mw1390.eqiad.wmnet, mw1430.eqiad.wmnet, parse1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, mw1495.eqiad.wmnet, parse1014.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1463.eqiad.wmnet, mw1435.eqiad.wmnet, mw1424.eqiad.wmnet, mw1454.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1017.eqiad.wmnet, wikikube-w [22:25:37] .eqiad.wmnet, mw1477.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, parse1012.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1496.eqiad.wmnet, kubernetes1060.eqiad.wmnet, mw1449.eqiad https://wikitech.wikimedia.org/wiki/PyBal [22:26:12] Amir1: my guess is you would have encountered this no matter the method of setting up circular replication [22:26:41] sigh [22:27:22] joy [22:27:33] the lvs issue can't be us. Right? [22:28:12] looking at that now [22:28:18] Amir1: no, that's likely an issue with the eventstreams service [22:28:27] looks like something has gone sideways with eventstreams, yeah [22:28:31] Amir I am filing this empty https://phabricator.wikimedia.org/T375144 [22:28:35] and going to bed :-D [22:28:45] I go to the airport now [22:28:54] thank you both for working on this, Amir1 and jynus! [22:28:56] ping me if something breaks really bad [22:30:32] the surprising thing about this is why this didn't break before, not why it broke today :-D [22:35:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:36:12] !incidents [22:36:13] 5258 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:36:19] !ack 5258 [22:36:19] 5258 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:40:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:40:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventstreams.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:40:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventstreams.svc.eqiad.wmnet:4892 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:43:02] !incidents [22:43:02] 5259 (UNACKED) ATSBackendErrorsHigh cache_text sre (eventstreams.discovery.wmnet eqiad) [22:43:02] 5258 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:43:09] !ack 5259 [22:43:10] 5259 (ACKED) ATSBackendErrorsHigh cache_text sre (eventstreams.discovery.wmnet eqiad) [22:43:17] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: sync [22:43:44] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync [22:44:34] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:44:37] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:45:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventstreams.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:45:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventstreams.svc.eqiad.wmnet:4892 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:03:33] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers parse1011.eqiad.wmnet, mw1419.eqiad.wmnet, mw1386.eqiad.wmnet, mw1470.eqiad.wmnet, mw1462.eqiad.wmnet, mw1388.eqiad.wmnet, mw1480.eqiad.wmnet, parse1009.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1435.eqiad.wmnet, mw1488.eqiad.wmnet, mw1454.eqiad.wmnet, parse1010.eqiad.wmnet, mw1425.eqiad.wmnet, kubernetes1012.eqiad.wmnet, [23:03:34] qiad.wmnet, kubernetes1033.eqiad.wmnet, kubernetes1014.eqiad.wmnet, mw1367.eqiad.wmnet, mw1486.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1360.eqiad.wmnet, mw1458.eqiad.wmnet, mw1468.eqiad.wmnet, mw1464.eqiad.wmnet, parse1019.eqiad.wmnet, kubernetes1056.eqiad.wmnet, mw1472.eqiad.wmnet, kubernetes1035.eqiad.wmnet, mw1379.eqiad.wmnet, parse1007.eqiad.wmnet, wikikube-worker1020.eqiad.wmnet, wikikube-worker1022.eqiad.wmnet, mw1378.eqiad.wmne [23:03:34] .eqiad.wmnet, mw1482.eqiad.wmnet, mw1357.eqiad.wmnet, mw1496.eqiad.wmnet, kubernetes1060.eqiad.wmnet, kubernetes1020.eqiad.wmnet, mw1397.eqiad.wmnet, kubernetes1027.eqiad.wmnet, mw1414. https://wikitech.wikimedia.org/wiki/PyBal [23:03:37] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers parse1011.eqiad.wmnet, mw1380.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1419.eqiad.wmnet, mw1434.eqiad.wmnet, mw1479.eqiad.wmnet, kubernetes1023.eqiad.wmnet, mw1462.eqiad.wmnet, kubernetes1030.eqiad.wmnet, parse1021.eqiad.wmnet, mw1424.eqiad.wmnet, mw1393.eqiad.wmnet, mw1488.eqiad.wmnet, mw1454.eqiad.wmnet, parse1005.eqia [23:03:37] wikikube-worker1003.eqiad.wmnet, mw1370.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1425.eqiad.wmnet, mw1395.eqiad.wmnet, mw1465.eqiad.wmnet, kubernetes1014.eqiad.wmnet, mw1466.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1369.eqiad.wmnet, mw1469.eqiad.wmnet, mw1486.eqiad.wmnet, wikikube-worker1001.eqiad.wmnet, mw1458.eqiad.wmnet, parse1001.eqiad.wmnet, mw1453.eqiad.wmnet, mw1468.eqiad.wmnet, wikikube-worker1010.eqiad.wmnet, kubernetes1015. [23:03:37] et, kubernetes1008.eqiad.wmnet, kubernetes1031.eqiad.wmnet, mw1464.eqiad.wmnet, mw1391.eqiad.wmnet, wikikube-worker1028.eqiad.wmnet, kubernetes1056.eqiad.wmnet, parse1006.eqiad.wmnet, p https://wikitech.wikimedia.org/wiki/PyBal [23:14:37] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:15:33] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:16:41] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10159193 (10Papaul) @Dwisehaupt hello since we decommissioned frban2001 is it possible for you to downtime and power down pay-lb2001 for us tomorrow... [23:30:12] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375037#10159198 (10phaultfinder) [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073911 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1073911 (owner: 10TrainBranchBot) [23:59:33] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10159218 (10Dwisehaupt) @Papaul All set. Powered down and set a downtime for 26 hours.