[00:01:59] <wikibugs>	 (03CR) 10Andrea Denisse: alert: Failover from alert1001 to alert2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[00:03:56] <icinga-wm>	 PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:07:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on parse2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:09:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[00:11:02] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072316 (owner: 10TrainBranchBot)
[00:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:12:27] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add gerrit::proxy profile to insetup::gerrit role [puppet] - 10https://gerrit.wikimedia.org/r/1072323 (https://phabricator.wikimedia.org/T372804)
[00:12:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T371742)', diff saved to https://phabricator.wikimedia.org/P68992 and previous config saved to /var/cache/conftool/dbconfig/20240912-001246-ladsgroup.json
[00:12:51] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[00:14:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[00:14:12] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:16:39] <wikibugs>	 (03PS1) 10Ladsgroup: tables-catalog: Add more extension tables [puppet] - 10https://gerrit.wikimedia.org/r/1072324 (https://phabricator.wikimedia.org/T363581)
[00:16:58] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:17:04] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:17:23] <wikibugs>	 (03CR) 10Scott French: "Thanks, Luca!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[00:17:28] <wikibugs>	 (03PS4) 10Andrea Denisse: alert: Failover from alert1001 to alert2002 [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418)
[00:20:29] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add more extension tables [puppet] - 10https://gerrit.wikimedia.org/r/1072324 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[00:20:41] <wikibugs>	 (03CR) 10Pppery: "Some of these sites would ideally point to more specific domains like mediawiki.wiki -> mediawiki.org rather than wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[00:22:07] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1072323/3960/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1072323 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[00:22:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:24:12] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:27:04] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:27:04] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:27:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:27:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P68993 and previous config saved to /var/cache/conftool/dbconfig/20240912-002753-ladsgroup.json
[00:28:36] <wikibugs>	 (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[00:29:19] <wikibugs>	 (03CR) 10Andrea Denisse: alert: Failover from alert1001 to alert2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[00:29:59] <wikibugs>	 (03CR) 10Dzahn: "agree with Pppery, see examples as inline comments. What's a good way to crowdsource the redirect mappings here? Can we upload manual patc" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[00:30:07] <Amir1>	 puppetservers are not happy
[00:30:14] <Amir1>	 and my puppet merge is stuck
[00:31:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:32:00] <icinga-wm>	 RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:32:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:32:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:33:52] <wikibugs>	 (03PS6) 10Andrea Denisse: alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418)
[00:35:12] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:35:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:36:03] <mutante>	 Amir1: normal puppet-merge is still on puppetmaster, not puppetserver
[00:36:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:36:29] <Amir1>	 I did that
[00:36:41] <mutante>	 I had no problems there and it seems merged now
[00:36:42] <Amir1>	 but it got stuck half way through sync
[00:36:58] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:37:04] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:37:04] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:37:04] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:37:07] <wikibugs>	 (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[00:37:10] <mutante>	 well, this looks like it's fixing itself
[00:37:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1170 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:37:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:37:18] <Amir1>	 https://www.irccloud.com/pastebin/UUbLu1uz/
[00:37:20] <mutante>	 puppetserver1001 was busy but not down
[00:37:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:37:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: rsyslog-imfile-remedy.service on parse2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:37:29] <mutante>	 the other alerts were all just about syncing to 1001
[00:37:45] <wikibugs>	 (03CR) 10Krinkle: logging: Fix local variables leaking into global scope (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński)
[00:37:47] <Amir1>	 I pasted what happened
[00:38:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:38:11] <Amir1>	 if it's fixing itself, I have no complaints :D 
[00:38:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:39:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:39:16] <wikibugs>	 (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[00:39:28] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1072326 (https://phabricator.wikimedia.org/T372418)
[00:39:37] <wikibugs>	 (03PS2) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1072326 (https://phabricator.wikimedia.org/T372418)
[00:40:12] <mutante>	 Amir1: puppet works on puppetserver1001, wfm
[00:40:13] <Amir1>	 https://wikitech.wikimedia.org/wiki/Puppet#puppet-merge_fails_to_sync_on_secondary
[00:40:14] <mutante>	 now
[00:40:26] <Amir1>	 funnily I wanted to try this, it didn't let me
[00:41:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1173 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:41:06] <Amir1>	 anywayyy
[00:41:12] <Amir1>	 I call it a "day"
[00:41:38] <mutante>	 yea, sounds like night :P
[00:42:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:43:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P68994 and previous config saved to /var/cache/conftool/dbconfig/20240912-004301-ladsgroup.json
[00:43:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1170 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:43:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:44:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1173 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:45:12] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:45:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:47:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:51:10] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:51:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:51:58] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[00:53:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:54:16] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 7447 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[00:57:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10139467 (10Papaul)  Some notes here: I checked console redirect, it was working for me and the issue i found was th...
[00:58:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T371742)', diff saved to https://phabricator.wikimedia.org/P68995 and previous config saved to /var/cache/conftool/dbconfig/20240912-005808-ladsgroup.json
[00:58:10] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance
[00:58:12] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[00:58:23] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance
[00:58:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T371742)', diff saved to https://phabricator.wikimedia.org/P68996 and previous config saved to /var/cache/conftool/dbconfig/20240912-005830-ladsgroup.json
[00:59:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10139485 (10Papaul) @elukey for console redirect to work on sretest2001 below are the settings. Thanks let me know if you have any questions  {F57501123} {F57501126}
[01:00:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139490 (10phaultfinder)
[01:01:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1146 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:02:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:03:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:18:43] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: logging: Fix WikimediaDebug "Verbose logging" option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583)
[01:19:04] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:19:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) (owner: 10Bartosz Dziewoński)
[01:27:04] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:34:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139515 (10phaultfinder)
[01:39:04] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:39:46] <icinga-wm>	 PROBLEM - dump of s8 in eqiad on backupmon1001 is CRITICAL: dump for s8 at eqiad (db1171) taken more than a week ago: Most recent backup 2024-09-03 01:27:35 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:52:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:54:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:00:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T371742)', diff saved to https://phabricator.wikimedia.org/P68998 and previous config saved to /var/cache/conftool/dbconfig/20240912-020050-ladsgroup.json
[02:00:58] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[02:15:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P68999 and previous config saved to /var/cache/conftool/dbconfig/20240912-021557-ladsgroup.json
[02:19:36] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:20:58] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[02:23:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:23:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:24:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:29:08] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:29:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139540 (10phaultfinder)
[02:30:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:30:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:30:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:31:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P69000 and previous config saved to /var/cache/conftool/dbconfig/20240912-023105-ladsgroup.json
[02:31:08] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:32:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:35:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[02:36:14] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:36:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:36:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:36:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:37:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:38:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1132 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:38:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:39:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:40:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[02:42:12] <jinxer-wm>	 FIRING: [3x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[02:42:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:43:10] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:44:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:46:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T371742)', diff saved to https://phabricator.wikimedia.org/P69001 and previous config saved to /var/cache/conftool/dbconfig/20240912-024612-ladsgroup.json
[02:46:14] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance
[02:46:16] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[02:46:28] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance
[02:46:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T371742)', diff saved to https://phabricator.wikimedia.org/P69002 and previous config saved to /var/cache/conftool/dbconfig/20240912-024635-ladsgroup.json
[02:50:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:53:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:54:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:54:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:00:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:01:10] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:02:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: rsyslog-imfile-remedy.service on parse2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:05:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:14:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:16:10] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:31:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:40:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:42:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10139563 (10Papaul) >>! In T371434#10120335, @cmooney wrote: >>>! In T371434#10119784, @Papaul wrote: >> The diagram below will outline the cabling of...
[03:51:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T371742)', diff saved to https://phabricator.wikimedia.org/P69003 and previous config saved to /var/cache/conftool/dbconfig/20240912-035105-ladsgroup.json
[03:51:09] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[03:55:47] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587 (10Papaul) 03NEW
[03:55:55] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10139580 (10Papaul) p:05Triage→03Medium
[03:56:49] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10139581 (10Papaul)
[04:06:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P69004 and previous config saved to /var/cache/conftool/dbconfig/20240912-040613-ladsgroup.json
[04:08:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[04:13:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[04:13:37] <wikibugs>	 (03CR) 10Pppery: "The log action for marking a page for translation is "pagetranslation", not "translationreview"." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 (owner: 10Jforrester)
[04:15:17] <wikibugs>	 (03CR) 10Pppery: "Like the idea, though - translation administration is a tedious, oftentimes underappreciated task that can easily get very backlogged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 (owner: 10Jforrester)
[04:21:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P69005 and previous config saved to /var/cache/conftool/dbconfig/20240912-042121-ladsgroup.json
[04:24:12] <wikibugs>	 (03PS17) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380)
[04:24:45] <wikibugs>	 (03CR) 10Ebrahim: "Added MediaWiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[04:27:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:27:34] <wikibugs>	 (03CR) 10Ebrahim: "Added MediaWiki wiki back, please review the change if possible, thank you very much" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[04:36:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T371742)', diff saved to https://phabricator.wikimedia.org/P69006 and previous config saved to /var/cache/conftool/dbconfig/20240912-043628-ladsgroup.json
[04:36:31] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance
[04:36:34] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[04:36:55] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance
[04:37:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T371742)', diff saved to https://phabricator.wikimedia.org/P69007 and previous config saved to /var/cache/conftool/dbconfig/20240912-043701-ladsgroup.json
[04:40:04] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:40:52] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 217, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:43:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:51:58] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[04:55:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:59:04] <wikibugs>	 (03CR) 10Ebrahim: Enable the dark mode in Portal namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[05:23:49] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1072216 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[05:28:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10139647 (10ABran-WMF) Thanks for the dig!  Indeed hardware error was misleading, will reimage the server and will let you know soon.
[05:31:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T371742)', diff saved to https://phabricator.wikimedia.org/P69008 and previous config saved to /var/cache/conftool/dbconfig/20240912-053116-ladsgroup.json
[05:31:24] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[05:44:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm
[05:46:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P69009 and previous config saved to /var/cache/conftool/dbconfig/20240912-054624-ladsgroup.json
[05:49:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10139703 (10ABran-WMF) a:05VRiley-WMF→03ABran-WMF
[05:50:46] <icinga-wm>	 RECOVERY - SSH on db1246 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:52:12] <jinxer-wm>	 FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[05:54:52] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2038 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1072337 (https://phabricator.wikimedia.org/T374592)
[05:58:21] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es7 T374592
[05:58:25] <stashbot>	 T374592: Switchover es7 master (es2039 -> es2038) - https://phabricator.wikimedia.org/T374592
[05:58:38] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T374592
[05:59:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set es2038 with weight 0 T374592', diff saved to https://phabricator.wikimedia.org/P69010 and previous config saved to /var/cache/conftool/dbconfig/20240912-055903-arnaudb.json
[06:00:19] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote es2038 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1072337 (https://phabricator.wikimedia.org/T374592) (owner: 10Gerrit maintenance bot)
[06:01:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P69011 and previous config saved to /var/cache/conftool/dbconfig/20240912-060131-ladsgroup.json
[06:02:25] <arnaudb>	 !log Starting es7 codfw failover from es2039 to es2038 - T374592
[06:02:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote es2038 to es7 primary and set section read-write T374592', diff saved to https://phabricator.wikimedia.org/P69012 and previous config saved to /var/cache/conftool/dbconfig/20240912-060308-arnaudb.json
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:05:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374592', diff saved to https://phabricator.wikimedia.org/P69013 and previous config saved to /var/cache/conftool/dbconfig/20240912-060550-arnaudb.json
[06:05:54] <stashbot>	 T374592: Switchover es7 master (es2039 -> es2038) - https://phabricator.wikimedia.org/T374592
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:11:35] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10139747 (10ABran-WMF) ES replication source in the path has been moved (T374592), all remaining hosts are depoolable
[06:16:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T371742)', diff saved to https://phabricator.wikimedia.org/P69014 and previous config saved to /var/cache/conftool/dbconfig/20240912-061639-ladsgroup.json
[06:16:41] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2202.codfw.wmnet with reason: Maintenance
[06:16:43] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[06:16:54] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2202.codfw.wmnet with reason: Maintenance
[06:19:13] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s3 T374421
[06:19:16] <stashbot>	 T374421: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T374421
[06:19:35] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T374421
[06:20:58] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[06:27:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1072311 (https://phabricator.wikimedia.org/T374386) (owner: 10Ladsgroup)
[06:33:19] <jayme>	 !log evacuating leadership for all partitions assigned to broker id 2004 on kafka-main-codfw - T363210
[06:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:23] <stashbot>	 T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210
[06:34:18] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[2004,2009].codfw.wmnet with reason: Hardware refresh
[06:34:33] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[2004,2009].codfw.wmnet with reason: Hardware refresh
[06:37:07] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:37:15] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:37:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1072313 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[06:44:00] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Decom kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/1072219 (https://phabricator.wikimedia.org/T374542) (owner: 10JMeybohm)
[06:45:28] <moritzm>	 !log installing glibc bugfix updates from bookworm 12.7 point release
[06:48:43] <wikibugs>	 (03PS1) 10JMeybohm: kafka-main: Replace kafka-main2004 with kafka-main2009 [puppet] - 10https://gerrit.wikimedia.org/r/1072441 (https://phabricator.wikimedia.org/T363210)
[06:55:22] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: provisionning db2229.codfw.wmnet - T373579
[06:55:26] <stashbot>	 T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579
[06:55:36] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: provisionning db2229.codfw.wmnet - T373579
[06:55:39] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2229.codfw.wmnet with reason: provisionning db2229.codfw.wmnet - T373579
[06:55:52] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2229.codfw.wmnet with reason: provisionning db2229.codfw.wmnet - T373579
[06:56:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2129 in db2229 for T373579', diff saved to https://phabricator.wikimedia.org/P69015 and previous config saved to /var/cache/conftool/dbconfig/20240912-065641-arnaudb.json
[06:58:48] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2129.codfw.wmnet onto db2229.codfw.wmnet
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:03:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[07:04:14] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] kafka-main: Replace kafka-main2004 with kafka-main2009 [puppet] - 10https://gerrit.wikimedia.org/r/1072441 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm)
[07:07:57] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Offboarding sandeeps [puppet] - 10https://gerrit.wikimedia.org/r/1072443
[07:08:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[07:09:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw
[07:10:14] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance
[07:10:28] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance
[07:10:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T371742)', diff saved to https://phabricator.wikimedia.org/P69016 and previous config saved to /var/cache/conftool/dbconfig/20240912-071034-ladsgroup.json
[07:10:39] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[07:11:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1072443 (owner: 10Slyngshede)
[07:13:34] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding sandeeps [puppet] - 10https://gerrit.wikimedia.org/r/1072443 (owner: 10Slyngshede)
[07:18:08] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.idm.logout Logging Sandeeps out of all services on: 2298 hosts
[07:18:43] <wikibugs>	 (03PS1) 10JMeybohm: Replace kafka-main2004 with kafka-main2009 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072472 (https://phabricator.wikimedia.org/T363210)
[07:18:53] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Sandeeps out of all services on: 2298 hosts
[07:19:14] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[07:19:21] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[07:19:23] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[07:19:46] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[07:19:47] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[07:19:59] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[07:20:00] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[07:20:31] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[07:20:33] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[07:20:46] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[07:20:48] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[07:21:22] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[07:21:23] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[07:21:57] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[07:21:58] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:22:07] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2004.codfw.wmnet - https://phabricator.wikimedia.org/T374594#10139799 (10JMeybohm)
[07:22:11] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:22:12] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:22:22] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:22:37] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Offboarding MNadrofsky [puppet] - 10https://gerrit.wikimedia.org/r/1072478
[07:24:51] <wikibugs>	 (03CR) 10Slyngshede: "User only appears as a name and is nowhere to be found in LDAP." [puppet] - 10https://gerrit.wikimedia.org/r/1072478 (owner: 10Slyngshede)
[07:26:21] <wikibugs>	 (03PS1) 10JMeybohm: Decom kafka-main2004 [puppet] - 10https://gerrit.wikimedia.org/r/1072479 (https://phabricator.wikimedia.org/T374594)
[07:28:48] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw
[07:31:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Ack. He had access to some procurement ACL in Phab, I had removed that earlier the morning." [puppet] - 10https://gerrit.wikimedia.org/r/1072478 (owner: 10Slyngshede)
[07:34:47] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2129.codfw.wmnet onto db2229.codfw.wmnet
[07:34:49] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding MNadrofsky [puppet] - 10https://gerrit.wikimedia.org/r/1072478 (owner: 10Slyngshede)
[07:36:51] <wikibugs>	 (03PS1) 10Muehlenhoff: lists: Enable profile::auto_restarts::service for spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1072482 (https://phabricator.wikimedia.org/T135991)
[07:37:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69017 and previous config saved to /var/cache/conftool/dbconfig/20240912-073744-arnaudb.json
[07:38:56] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[07:39:00] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[07:46:39] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[07:46:42] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[07:52:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 2%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69018 and previous config saved to /var/cache/conftool/dbconfig/20240912-075250-arnaudb.json
[07:58:30] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[07:58:35] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[08:02:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 (owner: 10Slyngshede)
[08:04:39] <wikibugs>	 (03CR) 10Ebrahim: Enable the dark mode in Portal namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[08:06:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T371742)', diff saved to https://phabricator.wikimedia.org/P69019 and previous config saved to /var/cache/conftool/dbconfig/20240912-080647-ladsgroup.json
[08:06:48] <wikibugs>	 (03PS18) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380)
[08:06:51] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[08:07:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 3%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69020 and previous config saved to /var/cache/conftool/dbconfig/20240912-080756-arnaudb.json
[08:09:50] <wikibugs>	 (03PS19) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380)
[08:15:47] <wikibugs>	 (03PS1) 10Fabfur: hiera: continue haproxykafka tests on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1072484 (https://phabricator.wikimedia.org/T370668)
[08:16:13] <wikibugs>	 (03PS1) 10Gmodena: mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210)
[08:21:43] <wikibugs>	 (03PS1) 10Ebrahim: Make LiquidThreads related dark mode namespace exceptions explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487
[08:21:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P69021 and previous config saved to /var/cache/conftool/dbconfig/20240912-082154-ladsgroup.json
[08:23:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 4%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69022 and previous config saved to /var/cache/conftool/dbconfig/20240912-082301-arnaudb.json
[08:23:33] <wikibugs>	 (03PS2) 10Ebrahim: Make LQT dark mode exceptions explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487
[08:25:25] <wikibugs>	 (03PS6) 10Slyngshede: PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812
[08:27:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:27:50] <wikibugs>	 (03CR) 10Slyngshede: PermissionRequest validation. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 (owner: 10Slyngshede)
[08:27:58] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 (owner: 10Slyngshede)
[08:29:47] <wikibugs>	 (03PS1) 10Ebrahim: Fix night mode excepted Wikidata namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488
[08:30:15] <wikibugs>	 (03Merged) 10jenkins-bot: PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 (owner: 10Slyngshede)
[08:31:40] <wikibugs>	 (03PS20) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380)
[08:33:22] <wikibugs>	 (03PS3) 10Ebrahim: Make LQT dark mode exceptions explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487
[08:33:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[08:34:14] <wikibugs>	 (03PS4) 10Ebrahim: Make LQT night mode exceptions explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487
[08:34:46] <elukey>	 Amir1, mutante - o/ re:puppetserver1001, we had a little outage while Amir merged (see https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=puppetserver1001&var-datasource=thanos&var-cluster=misc&from=1726091490618&to=1726107388975) - related to https://phabricator.wikimedia.org/T373527, I'll update the task
[08:35:41] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:35:53] <wikibugs>	 (03PS2) 10Slyngshede: Redesign menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038608
[08:36:37] <wikibugs>	 (03PS2) 10JMeybohm: mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) (owner: 10Gmodena)
[08:37:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P69023 and previous config saved to /var/cache/conftool/dbconfig/20240912-083701-ladsgroup.json
[08:37:06] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) (owner: 10Gmodena)
[08:38:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69024 and previous config saved to /var/cache/conftool/dbconfig/20240912-083807-arnaudb.json
[08:40:43] <wikibugs>	 (03PS3) 10Slyngshede: P:idp_test: Enable permission requests on testing. [puppet] - 10https://gerrit.wikimedia.org/r/1072107
[08:42:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:43:53] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idp_test: Enable permission requests on testing. [puppet] - 10https://gerrit.wikimedia.org/r/1072107 (owner: 10Slyngshede)
[08:44:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:45:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] logging: add script to query for orphan traces [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi)
[08:45:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139873 (10phaultfinder)
[08:46:15] <wikibugs>	 (03PS3) 10JMeybohm: mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) (owner: 10Gmodena)
[08:46:15] <wikibugs>	 (03PS2) 10JMeybohm: Replace kafka-main2004 with kafka-main2009 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072472 (https://phabricator.wikimedia.org/T363210)
[08:47:02] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main2009.codfw.wmnet
[08:47:03] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2009.codfw.wmnet
[08:48:10] <wikibugs>	 (03PS3) 10Filippo Giunchedi: logging: add script to query for orphan traces [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411)
[08:48:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: logging: add script to query for orphan traces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi)
[08:48:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] logging: add script to query for orphan traces [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi)
[08:49:25] <jayme>	 !log restoring leadership for all partitions assigned to broker id 2004 on kafka-main-codfw - T363210
[08:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:28] <stashbot>	 T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210
[08:50:49] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:51:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10139881 (10elukey) Thanks! I created a diff from the settings dumped before your fix(es) and after, from the Redfish point of view.  ` Diff for BootModeSelect: before L...
[08:51:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: jaeger: fix typo ensure vs require [puppet] - 10https://gerrit.wikimedia.org/r/1072492
[08:51:58] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[08:52:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: fix typo ensure vs require [puppet] - 10https://gerrit.wikimedia.org/r/1072492 (owner: 10Filippo Giunchedi)
[08:52:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T371742)', diff saved to https://phabricator.wikimedia.org/P69025 and previous config saved to /var/cache/conftool/dbconfig/20240912-085209-ladsgroup.json
[08:52:12] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance
[08:52:13] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[08:52:20] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1071811 (https://phabricator.wikimedia.org/T374421) (owner: 10Gerrit maintenance bot)
[08:52:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version
[08:52:25] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance
[08:52:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T371742)', diff saved to https://phabricator.wikimedia.org/P69026 and previous config saved to /var/cache/conftool/dbconfig/20240912-085232-ladsgroup.json
[08:53:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69027 and previous config saved to /var/cache/conftool/dbconfig/20240912-085312-arnaudb.json
[08:54:18] <arnaudb>	 !log Starting s3 codfw failover from db2209 to db2205 - T374421
[08:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:21] <stashbot>	 T374421: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T374421
[08:56:01] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443)
[08:57:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[08:58:09] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072472 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm)
[08:58:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[08:58:48] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Replace kafka-main2004 with kafka-main2009 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072472 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm)
[08:58:52] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) (owner: 10Gmodena)
[08:59:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2205 to s3 primary T374421', diff saved to https://phabricator.wikimedia.org/P69028 and previous config saved to /var/cache/conftool/dbconfig/20240912-085859-arnaudb.json
[08:59:41] <wikibugs>	 (03PS3) 10Elukey: sre.hosts.provision: improve Supermicro's bios settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372)
[08:59:47] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) (owner: 10Gmodena)
[09:00:25] <wikibugs>	 (03Merged) 10jenkins-bot: Replace kafka-main2004 with kafka-main2009 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072472 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm)
[09:00:56] <jinxer-wm>	 FIRING: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:00:57] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: improve Supermicro's bios settings (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[09:01:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374421', diff saved to https://phabricator.wikimedia.org/P69029 and previous config saved to /var/cache/conftool/dbconfig/20240912-090157-arnaudb.json
[09:02:00] <stashbot>	 T374421: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T374421
[09:03:59] <wikibugs>	 (03PS2) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443)
[09:04:59] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10139909 (10ABran-WMF) >>! In T374523#10136865, @cmooney wrote: >>>! In T374523#10136856, @ABran-WMF wrote: >> I'll get to T374425 to get to T374421 and unblo...
[09:06:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[09:06:52] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[09:07:28] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[09:08:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 15%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69030 and previous config saved to /var/cache/conftool/dbconfig/20240912-090818-arnaudb.json
[09:12:12] <wikibugs>	 (03PS4) 10Elukey: sre.hosts.provision: improve Supermicro's bios settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372)
[09:12:46] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[09:13:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] alert: Failover from alert1001 to alert2002 [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[09:14:01] <wikibugs>	 (03PS3) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443)
[09:14:02] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10139921 (10cmooney) >>! In T374523#10139909, @ABran-WMF wrote: > We can add it to today's maintenance if you're up to it. Let me know so I can add it to the...
[09:14:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1072326 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[09:14:08] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kafka-main2004.codfw.wmnet
[09:14:33] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10139922 (10ABran-WMF) ack, adding it to the pile
[09:15:43] <wikibugs>	 (03CR) 10Effie Mouzeli: "wait for it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 (owner: 10Effie Mouzeli)
[09:16:07] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[09:16:24] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1072479 (https://phabricator.wikimedia.org/T374594) (owner: 10JMeybohm)
[09:16:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10139923 (10elukey) It happened again, this time to puppetserver1001. Amir was in the middle of a puppet-merge and it got stuck. OOM killer acting on the puppetser...
[09:16:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[09:17:21] <wikibugs>	 (03CR) 10Elukey: Swap poolcounter2003 with poolcounter2005 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[09:19:05] <wikibugs>	 (03PS4) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443)
[09:19:44] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[09:21:22] <wikibugs>	 (03PS1) 10Effie Mouzeli: app.job: update to job 3.0.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072500
[09:22:41] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[09:22:56] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[09:23:19] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[09:23:19] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:23:19] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-main2004.codfw.wmnet
[09:23:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69031 and previous config saved to /var/cache/conftool/dbconfig/20240912-092324-arnaudb.json
[09:23:27] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Decom kafka-main2004 [puppet] - 10https://gerrit.wikimedia.org/r/1072479 (https://phabricator.wikimedia.org/T374594) (owner: 10JMeybohm)
[09:24:57] <wikibugs>	 (03PS1) 10Elukey: services: switch thumbor in codfw to poolcounter2005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015)
[09:25:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission kafka-main2004.codfw.wmnet - https://phabricator.wikimedia.org/T374594#10139958 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kafka-main2004.codfw.wmnet` - kafka-main2004.codf...
[09:26:05] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1072482 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:26:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission kafka-main2004.codfw.wmnet - https://phabricator.wikimedia.org/T374594#10139963 (10JMeybohm) a:05JMeybohm→03None
[09:27:29] <wikibugs>	 (03CR) 10Elukey: [C:04-1] "Needs to be tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[09:28:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[09:28:59] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] lists: Enable profile::auto_restarts::service for spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1072482 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:31:16] <wikibugs>	 (03PS2) 10Elukey: services: switch thumbor in codfw to poolcounter2005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015)
[09:31:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:04-1] "Change LGTM, though there are more users of check_ntp_peer" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh)
[09:31:34] <wikibugs>	 (03CR) 10Elukey: "Hugh: Lemme know what you think about it :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[09:32:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] mysql: replication lag monitoring threshold and severity change [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[09:32:37] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[09:32:59] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mysql: replication lag monitoring threshold and severity change [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[09:33:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[09:34:37] <wikibugs>	 (03Merged) 10jenkins-bot: mysql: replication lag monitoring threshold and severity change [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[09:35:18] <wikibugs>	 (03PS1) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502
[09:35:27] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[09:36:38] <wikibugs>	 (03PS2) 10Effie Mouzeli: app.job: update to job 3.0.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072500
[09:36:46] <wikibugs>	 (03PS2) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502
[09:37:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10140011 (10elukey) Updated https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1071553 and tested, it seems working. I kicked off a reimage of sretest2001, and I en...
[09:38:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] lists: Enable profile::auto_restarts::service for spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1072482 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:38:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[09:38:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69032 and previous config saved to /var/cache/conftool/dbconfig/20240912-093829-arnaudb.json
[09:38:57] <jinxer-wm>	 FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:39:07] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "Sounds good to me!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[09:39:10] <vgutierrez>	 !incidents
[09:39:10] <sirenbot>	 5160 (UNACKED)  ProbeDown sre (10.2.2.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 eqiad)
[09:39:11] <sirenbot>	 5158 (RESOLVED)  NELHigh sre (thanos-rule tcp.address_unreachable)
[09:39:13] <vgutierrez>	 !ack 5160
[09:39:14] <sirenbot>	 5160 (ACKED)  ProbeDown sre (10.2.2.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 eqiad)
[09:39:27] * vgutierrez looking
[09:39:31] <jayme>	 that one again
[09:39:48] <claime>	 worker crunch in eqiad
[09:40:11] <claime>	 p99 at 5 minutes, awesome
[09:40:14] <vgutierrez>	 another crawler?
[09:40:18] <jayme>	 keeps on giving
[09:40:22] <jynus>	 high memcache errors, or is that just a consequence?
[09:40:37] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC error seems like a temporal glitch in the matrix" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[09:40:42] <claime>	 vgutierrez: no, I bet it's just taking ages to respond, there's like 4rps
[09:41:07] <vgutierrez>	 https://grafana.wikimedia.org/goto/qmh61Y6SR?orgId=1
[09:41:10] <vgutierrez>	 yeah.. no traffic at all
[09:41:31] <claime>	 weird, the executor isn't loaded though
[09:41:45] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
[09:41:53] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:43:16] <vgutierrez>	 pybal is keeping not healthy realservers pooled
[09:43:34] <claime>	 vgutierrez: there's only 2 replicas
[09:43:39] <claime>	 so yeah not surprising
[09:43:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:44:13] <vgutierrez>	 claime: well.. pybal has 210 realservers configured for for wikifunctions
[09:44:26] <vgutierrez>	 k8s magic :)
[09:44:33] <claime>	 vgutierrez: yeah, k8s x)
[09:45:56] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:46:56] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[09:46:57] <wikibugs>	 (03PS1) 10Brouberol: cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281)
[09:47:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[09:47:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol)
[09:47:49] <vgutierrez>	 I'm guessing that's not related at all to wikifunctions
[09:48:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Only run puppetserver spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1072505
[09:48:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Only run puppetserver spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1072505 (owner: 10Muehlenhoff)
[09:49:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T371742)', diff saved to https://phabricator.wikimedia.org/P69033 and previous config saved to /var/cache/conftool/dbconfig/20240912-094912-ladsgroup.json
[09:49:16] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[09:50:56] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:51:56] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[09:52:12] <jinxer-wm>	 FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[09:52:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[09:52:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[09:53:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Only run puppetserver spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1072505
[09:53:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[09:53:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69034 and previous config saved to /var/cache/conftool/dbconfig/20240912-095335-arnaudb.json
[09:53:40] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1246.eqiad.wmnet with OS bookworm
[09:55:56] <wikibugs>	 (03PS1) 10Clément Goubert: mw-wikifunctions: Raise replicas to 6 per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072508
[09:57:21] <topranks>	 Folks myself and Ben are doing a test on cephosd1001 to test failover for the Anycast BGP service on it 
[09:57:32] <topranks>	 we are not downtiming the host so we can observe what alerts trigger - please ignore 
[09:57:45] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[09:57:59] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French)
[09:58:28] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] mw-wikifunctions: Raise replicas to 6 per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072508 (owner: 10Clément Goubert)
[09:58:48] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw-wikifunctions: Raise replicas to 6 per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072508 (owner: 10Clément Goubert)
[09:58:54] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-wikifunctions: Raise replicas to 6 per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072508 (owner: 10Clément Goubert)
[09:59:04] <btullis>	 !log stopping envoyproxy on cephosd1001
[09:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:46] <wikibugs>	 (03Merged) 10jenkins-bot: mw-wikifunctions: Raise replicas to 6 per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072508 (owner: 10Clément Goubert)
[09:59:53] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] aptrepo: ffmpeg bullseye component [puppet] - 10https://gerrit.wikimedia.org/r/1072282 (https://phabricator.wikimedia.org/T374502) (owner: 10Scott French)
[10:00:12] <claime>	 !log Increasing mw-wikifunctions replicas to 6
[10:00:23] <btullis>	 !log restarted envoyproxy on cephosd1001
[10:00:31] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[10:00:46] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[10:01:15] <wikibugs>	 (03PS5) 10Brouberol: cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281)
[10:01:37] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[10:04:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P69035 and previous config saved to /var/cache/conftool/dbconfig/20240912-100419-ladsgroup.json
[10:04:38] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[10:04:47] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Add ATS map for lists.wikimedia.org -> lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1072247
[10:07:18] <wikibugs>	 (03CR) 10Hashar: [C:03+1] logging: Fix WikimediaDebug "Verbose logging" option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) (owner: 10Bartosz Dziewoński)
[10:07:37] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: switch thumbor in codfw to poolcounter2005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[10:07:38] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[10:07:51] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[10:08:04] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[10:08:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T367856)', diff saved to https://phabricator.wikimedia.org/P69036 and previous config saved to /var/cache/conftool/dbconfig/20240912-100811-ladsgroup.json
[10:08:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69037 and previous config saved to /var/cache/conftool/dbconfig/20240912-100841-arnaudb.json
[10:11:00] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync
[10:11:05] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[10:19:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P69038 and previous config saved to /var/cache/conftool/dbconfig/20240912-101927-ladsgroup.json
[10:20:58] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[10:22:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:25:28] <wikibugs>	 (03PS3) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502
[10:25:35] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for et-0-0-31-100.ssw1-f1-eqiad.eqiad.wmnet - cmooney@cumin1002"
[10:25:40] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for et-0-0-31-100.ssw1-f1-eqiad.eqiad.wmnet - cmooney@cumin1002"
[10:25:40] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:25:43] <btullis>	 !log stopping envoyproxy on cephosd1001
[10:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli)
[10:31:17] <wikibugs>	 (03PS4) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502
[10:32:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli)
[10:32:58] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache 2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.0.e.f.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa. on all recursors
[10:33:02] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.0.e.f.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa. on all recursors
[10:34:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T371742)', diff saved to https://phabricator.wikimedia.org/P69039 and previous config saved to /var/cache/conftool/dbconfig/20240912-103434-ladsgroup.json
[10:34:38] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[10:42:09] <wikibugs>	 (03PS5) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502
[10:45:06] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072246 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan)
[10:46:00] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072246 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan)
[10:49:04] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] service: add basic configuration for mwdebug-next [puppet] - 10https://gerrit.wikimedia.org/r/1071933 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French)
[10:50:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[10:54:31] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072246 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan)
[10:55:56] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072246 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan)
[10:57:51] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: eqiad1: fix instances_ip_ranges parameter [puppet] - 10https://gerrit.wikimedia.org/r/1072513 (https://phabricator.wikimedia.org/T374020)
[10:57:58] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072513 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[10:59:15] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[10:59:41] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[11:00:44] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: app.job: update to job 2.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 (owner: 10Effie Mouzeli)
[11:02:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:03:05] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[11:03:53] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[11:04:13] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: eqiad1: fix instances_ip_ranges parameter [puppet] - 10https://gerrit.wikimedia.org/r/1072513 (https://phabricator.wikimedia.org/T374020)
[11:04:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072513 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[11:05:09] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10140285 (10Vgutierrez) @RobH / @wiki_willy could we get this task prioritized on your side?
[11:05:31] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2093.codfw.wmnet
[11:05:33] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2093.codfw.wmnet
[11:06:05] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2029.codfw.wmnet
[11:06:07] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2029.codfw.wmnet
[11:07:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[11:07:40] <hashar>	 jouncebot: nowandnext
[11:07:41] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 52 minute(s)
[11:07:41] <jouncebot>	 In 0 hour(s) and 52 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1200)
[11:12:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[11:12:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: keystone: eqiad1: fix instances_ip_ranges parameter [puppet] - 10https://gerrit.wikimedia.org/r/1072513 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[11:13:24] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[11:14:10] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[11:17:56] <wikibugs>	 06SRE: Arelion transport to eqsin from codfw maxing out - Sept 12 2024 - https://phabricator.wikimedia.org/T374608 (10cmooney) 03NEW p:05Triage→03High
[11:29:18] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:29:47] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Increase buffer pool for db1171:s8, which is lagging [puppet] - 10https://gerrit.wikimedia.org/r/1072515 (https://phabricator.wikimedia.org/T374610)
[11:30:09] <wikibugs>	 06SRE: Arelion transport to eqsin from codfw maxing out - Sept 12 2024 - https://phabricator.wikimedia.org/T374608#10140350 (10cmooney) Nothing in Superset is jumping out at me.  From the netflow's I suspect it may be AS138341 / SHOPEE SINGAPORE PRIVATE LIMITED.  The spike in traffic starting yesterday afternoon...
[11:31:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1038608 (owner: 10Slyngshede)
[11:32:16] <logmsgbot>	 jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade.
[11:33:48] <wikibugs>	 (03PS2) 10Urbanecm: Babel: Set BabelUseCommunityConfiguration to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071916 (https://phabricator.wikimedia.org/T374611)
[11:33:51] <wikibugs>	 (03PS2) 10Urbanecm: [beta] Babel: Use CommunityConfiguration in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071917 (https://phabricator.wikimedia.org/T374611)
[11:33:57] <urbanecm>	 jouncebot: nowandnext
[11:33:58] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 26 minute(s)
[11:33:58] <jouncebot>	 In 0 hour(s) and 26 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1200)
[11:34:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071916 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm)
[11:35:01] <wikibugs>	 (03Merged) 10jenkins-bot: Babel: Set BabelUseCommunityConfiguration to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071916 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm)
[11:35:28] <wikibugs>	 (03PS3) 10Urbanecm: [beta] Babel: Use CommunityConfiguration in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071917 (https://phabricator.wikimedia.org/T374611)
[11:35:34] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1071916|Babel: Set BabelUseCommunityConfiguration to false (T374611)]]
[11:35:38] <stashbot>	 T374611: Switch BabelUseCommunityConfiguration to true on Beta cluster - https://phabricator.wikimedia.org/T374611
[11:36:31] <wikibugs>	 (03PS1) 10Urbanecm: [beta] Babel: Use CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072517 (https://phabricator.wikimedia.org/T374611)
[11:38:05] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version
[11:38:55] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Increase buffer pool for db1171:s8, which is lagging [puppet] - 10https://gerrit.wikimedia.org/r/1072515 (https://phabricator.wikimedia.org/T374610) (owner: 10Jcrespo)
[11:42:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:46:59] <jynus>	 !log restarting db1171:s7 mysql process T374610
[11:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:03] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071916|Babel: Set BabelUseCommunityConfiguration to false (T374611)]] (duration: 11m 28s)
[11:47:03] <stashbot>	 T374610: db1171:s8 is having performance issues and lagging - https://phabricator.wikimedia.org/T374610
[11:47:07] <stashbot>	 T374611: Switch BabelUseCommunityConfiguration to true on Beta cluster - https://phabricator.wikimedia.org/T374611
[11:47:22] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [beta] Babel: Use CommunityConfiguration in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071917 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm)
[11:47:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:48:03] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Babel: Use CommunityConfiguration in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071917 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1200)
[12:01:14] <wikibugs>	 (03CR) 10Ladsgroup: "I'd say let's finish prod dbs (and decommission old ones) and then start working on dbproxies. So many in progress stuff is hard for me to" [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb)
[12:04:12] <wikibugs>	 (03CR) 10Ladsgroup: "Now we have two candidate masters for s6 in codfw which would break switchmaster tool if you try to use it for s6. We should do something " [puppet] - 10https://gerrit.wikimedia.org/r/1072216 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[12:07:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[12:11:18] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2024.09.06 - 2024.09.27): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10140474 (10BTullis) Listed the logical devices: ` btullis@an-worker1085:~$ sudo megacli -LDInfo -Lall -a0|grep Drive: Virtual Drive: 0 (Target Id: 0) Virtual Drive: 1...
[12:11:59] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1085.eqiad.wmnet
[12:12:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[12:13:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[12:15:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[12:18:10] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Redesign menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038608 (owner: 10Slyngshede)
[12:18:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[12:19:54] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1085.eqiad.wmnet
[12:20:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[12:20:29] <wikibugs>	 (03CR) 10Elukey: "One nit and we are good to go!" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[12:20:30] <wikibugs>	 (03Merged) 10jenkins-bot: Redesign menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038608 (owner: 10Slyngshede)
[12:21:48] <wikibugs>	 (03PS1) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528
[12:22:18] <wikibugs>	 (03PS2) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528
[12:22:20] <wikibugs>	 (03PS1) 10Btullis: Enable the performace CPU governor on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072529 (https://phabricator.wikimedia.org/T365878)
[12:23:03] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2024.09.06 - 2024.09.27): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10140492 (10BTullis) 05Open→03Resolved
[12:23:35] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3964/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072529 (https://phabricator.wikimedia.org/T365878) (owner: 10Btullis)
[12:24:40] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2390.codfw.wmnet
[12:25:18] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2390.codfw.wmnet
[12:25:23] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2394.codfw.wmnet
[12:25:38] <wikibugs>	 06SRE: Arelion transport to eqsin from codfw maxing out - Sept 12 2024 - https://phabricator.wikimedia.org/T374608#10140496 (10cmooney) We added a requestctl rule for IP range 147.136.175.0/24 which has brought usage back within acceptable levels and we no longer see dropped packets on the link.  {F57502814 widt...
[12:26:00] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2394.codfw.wmnet
[12:26:05] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2395.codfw.wmnet
[12:26:38] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2395.codfw.wmnet
[12:26:43] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2396.codfw.wmnet
[12:27:20] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2396.codfw.wmnet
[12:27:25] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2397.codfw.wmnet
[12:27:59] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2397.codfw.wmnet
[12:28:04] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2398.codfw.wmnet
[12:28:38] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2398.codfw.wmnet
[12:28:43] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2399.codfw.wmnet
[12:29:04] <urbanecm>	 jouncebot: nowandnext
[12:29:04] <jouncebot>	 For the next 0 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1200)
[12:29:04] <jouncebot>	 In 0 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1300)
[12:29:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host deploy2002.codfw.wmnet
[12:29:16] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2399.codfw.wmnet
[12:29:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10140498 (10phaultfinder)
[12:29:46] <wikibugs>	 (03PS2) 10Jforrester: On wikis with the Translate extension, allow thanking of translationreview log actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315
[12:29:51] <wikibugs>	 (03CR) 10Jforrester: "Aha, right, the log group is translationreview but the action is pagetranslation. Meh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 (owner: 10Jforrester)
[12:30:48] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] "beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072517 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm)
[12:30:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Rename mw239[0456789] to wikikube-worker21[07-13] [puppet] - 10https://gerrit.wikimedia.org/r/1072532 (https://phabricator.wikimedia.org/T372878)
[12:31:14] <akosiaris>	 !log depool mw239[0456789]  for re-numbering, renaming and reimaging.
[12:31:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:30] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Babel: Use CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072517 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm)
[12:31:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch deploy2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072533 (https://phabricator.wikimedia.org/T349619)
[12:34:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch deploy2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072533 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:34:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Rename mw239[0456789] to wikikube-worker21[07-13] [puppet] - 10https://gerrit.wikimedia.org/r/1072532 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris)
[12:35:13] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync
[12:35:18] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv
[12:35:18] <icinga-wm>	 e - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:35:30] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv
[12:35:30] <icinga-wm>	 e - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:36:11] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2123.codfw.wmnet with reason: Maintenance
[12:36:24] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2123.codfw.wmnet with reason: Maintenance
[12:36:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T367781)', diff saved to https://phabricator.wikimedia.org/P69040 and previous config saved to /var/cache/conftool/dbconfig/20240912-123631-arnaudb.json
[12:36:32] <wikibugs>	 (03PS1) 10Aklapper: Weekly Phabricator data for Tech News: Make output MediaWiki pastable [puppet] - 10https://gerrit.wikimedia.org/r/1072535 (https://phabricator.wikimedia.org/T373952)
[12:36:37] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[12:37:39] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] logging: Fix WikimediaDebug "Verbose logging" option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) (owner: 10Bartosz Dziewoński)
[12:38:23] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[12:40:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614 (10cmooney) 03NEW p:05Triage→03Medium
[12:41:26] <elukey>	 !log thumbor codfw on wikikube moved to poolcounter2005 - T332015
[12:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:29] <stashbot>	 T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015
[12:42:40] <jinxer-wm>	 FIRING: [3x] KubernetesRsyslogDown: rsyslog on mw2394:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:44:17] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2390 to wikikube-worker2107
[12:44:18] <wikibugs>	 (03PS1) 10Aklapper: Weekly Phabricator data for Tech News: Add Auto-Submitted [puppet] - 10https://gerrit.wikimedia.org/r/1072536
[12:44:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T367781)', diff saved to https://phabricator.wikimedia.org/P69041 and previous config saved to /var/cache/conftool/dbconfig/20240912-124421-arnaudb.json
[12:44:26] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[12:44:37] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[12:45:51] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2209.codfw.wmnet with reason: Maintenance
[12:45:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian)
[12:46:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 2 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10140598 (10elukey) 05Resolved→03Open Using this task to create another VM, poolcounter2006.
[12:46:05] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2209.codfw.wmnet with reason: Maintenance
[12:46:06] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2139.codfw.wmnet with reason: Maintenance
[12:46:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 2 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10140601 (10elukey)
[12:46:19] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2139.codfw.wmnet with reason: Maintenance
[12:46:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T370903)', diff saved to https://phabricator.wikimedia.org/P69042 and previous config saved to /var/cache/conftool/dbconfig/20240912-124626-ladsgroup.json
[12:46:30] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[12:47:40] <jinxer-wm>	 FIRING: [6x] KubernetesRsyslogDown: rsyslog on mw2394:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:47:48] <wikibugs>	 (03PS1) 10Elukey: Add configuration for poolcounter2006 [puppet] - 10https://gerrit.wikimedia.org/r/1072537 (https://phabricator.wikimedia.org/T374520)
[12:48:03] <cscott>	 MatmaRex, Lucas_WMDE i'd like to try the final bit of the MOS namespace, enwiki, today
[12:48:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: enable new vxlan-based subnet CIDR in cloudgw and keystone [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020)
[12:48:29] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2390 to wikikube-worker2107 - akosiaris@cumin1002"
[12:48:37] <wikibugs>	 (03PS2) 10Elukey: Add configuration for poolcounter2006 [puppet] - 10https://gerrit.wikimedia.org/r/1072537 (https://phabricator.wikimedia.org/T374520)
[12:49:02] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2390 to wikikube-worker2107 - akosiaris@cumin1002"
[12:49:02] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:49:03] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2107
[12:49:16] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2107
[12:49:54] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2390 to wikikube-worker2107
[12:50:09] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: codfw1dev: enable new vxlan-based subnet CIDR in cloudgw and keystone [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020)
[12:50:09] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2390 to wikikube-worker2107 completed: - mw2390 (**...
[12:50:15] <wikibugs>	 (03PS4) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538)
[12:50:18] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2394 to wikikube-worker2108
[12:50:39] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[12:50:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM. Let's also create it in row B, like poolcounter2004." [puppet] - 10https://gerrit.wikimedia.org/r/1072537 (https://phabricator.wikimedia.org/T374520) (owner: 10Elukey)
[12:50:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host deploy2002.codfw.wmnet
[12:51:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[12:51:37] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Add configuration for poolcounter2006 [puppet] - 10https://gerrit.wikimedia.org/r/1072537 (https://phabricator.wikimedia.org/T374520) (owner: 10Elukey)
[12:51:58] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[12:52:39] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: codfw1dev: enable new vxlan-based subnet CIDR in cloudgw and keystone [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020)
[12:52:40] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#10140638 (10fnegri)
[12:53:03] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[12:53:20] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host poolcounter2006.codfw.wmnet
[12:53:54] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2394 to wikikube-worker2108 - akosiaris@cumin1002"
[12:54:11] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2394 to wikikube-worker2108 - akosiaris@cumin1002"
[12:54:11] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:54:12] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.netbox
[12:54:12] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2108
[12:54:24] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2108
[12:55:02] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2394 to wikikube-worker2108
[12:55:13] <wikibugs>	 (03CR) 10C. Scott Ananian: "Post-deploy maintenance script commands listed at T363538#10140642" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian)
[12:55:14] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140645 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2394 to wikikube-worker2108 completed: - mw2394 (**...
[12:56:55] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, I trust your testing :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[12:57:09] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2395 to wikikube-worker2109
[12:57:34] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter2006.codfw.wmnet - elukey@cumin1002"
[12:58:15] <wikibugs>	 (03PS1) 10Arnaudb: bashrc: add 2 helper function [puppet] - 10https://gerrit.wikimedia.org/r/1072539
[12:58:17] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] bashrc: add 2 helper function [puppet] - 10https://gerrit.wikimedia.org/r/1072539 (owner: 10Arnaudb)
[12:58:44] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter2006.codfw.wmnet - elukey@cumin1002"
[12:58:45] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:58:45] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache poolcounter2006.codfw.wmnet on all recursors
[12:58:45] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[12:58:48] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) poolcounter2006.codfw.wmnet on all recursors
[12:59:11] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Nice catch, thank you! I am going to leave those alone since they are used for the anycast checks and add a new custom command." [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh)
[12:59:15] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter2006.codfw.wmnet - elukey@cumin1002"
[12:59:20] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter2006.codfw.wmnet - elukey@cumin1002"
[12:59:23] <wikibugs>	 (03PS2) 10Ssingh: P:ntp and nagios_core: update check_ntp_peer to include stratum checks [puppet] - 10https://gerrit.wikimedia.org/r/1072276
[12:59:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P69043 and previous config saved to /var/cache/conftool/dbconfig/20240912-125928-arnaudb.json
[12:59:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:ntp and nagios_core: update check_ntp_peer to include stratum checks [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1300).
[13:00:05] <jouncebot>	 MatmaRex and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:17] <MatmaRex>	 hi. and hi cscott
[13:00:27] * cscott waves
[13:00:36] <cscott>	 i'm here ready to compete for the t-shirt
[13:00:57] <wikibugs>	 (03PS3) 10Ssingh: P:ntp and nagios_core: update check_ntp_peer to include stratum checks [puppet] - 10https://gerrit.wikimedia.org/r/1072276
[13:01:52] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3966/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh)
[13:02:03] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host poolcounter2006.codfw.wmnet with OS bookworm
[13:03:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10140660 (10cmooney)
[13:03:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[13:03:53] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[13:03:58] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2395 to wikikube-worker2109 - akosiaris@cumin1002"
[13:04:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T367781)', diff saved to https://phabricator.wikimedia.org/P69044 and previous config saved to /var/cache/conftool/dbconfig/20240912-130400-arnaudb.json
[13:04:02] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2395 to wikikube-worker2109 - akosiaris@cumin1002"
[13:04:03] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:04:03] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2109
[13:04:04] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[13:04:09] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2107.codfw.wmnet
[13:04:19] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140667 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by akosiaris@cumin1002 Renumbering for host wikikube-worker2107.codfw.wm...
[13:04:28] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2107.codfw.wmnet with OS bullseye
[13:04:38] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140668 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2107.codfw.wmnet with OS bull...
[13:04:38] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2107
[13:04:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T370903)', diff saved to https://phabricator.wikimedia.org/P69045 and previous config saved to /var/cache/conftool/dbconfig/20240912-130441-ladsgroup.json
[13:04:43] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:04:45] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[13:05:36] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2109
[13:06:03] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2108.codfw.wmnet
[13:06:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T367781)', diff saved to https://phabricator.wikimedia.org/P69046 and previous config saved to /var/cache/conftool/dbconfig/20240912-130608-arnaudb.json
[13:06:14] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2395 to wikikube-worker2109
[13:06:15] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140673 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by akosiaris@cumin1002 Renumbering for host wikikube-worker2108.codfw.wm...
[13:06:19] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2108.codfw.wmnet with OS bullseye
[13:06:27] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2395 to wikikube-worker2109 completed: - mw2395 (**...
[13:06:30] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140680 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2108.codfw.wmnet with OS bull...
[13:06:34] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2396 to wikikube-worker2110
[13:07:21] <MatmaRex>	 anyone deploying?
[13:07:43] <cscott>	 i was kinda hoping Lucas_WMDE was the deployer today, since he did the MOS namespace stuff on Tuesday
[13:08:02] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2107 - akosiaris@cumin1002"
[13:08:18] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2107 - akosiaris@cumin1002"
[13:08:18] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:08:18] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2107.codfw.wmnet 53.0.192.10.in-addr.arpa 3.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:08:21] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2107.codfw.wmnet 53.0.192.10.in-addr.arpa 3.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:08:22] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2107
[13:08:32] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2107
[13:08:32] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2107
[13:08:40] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:08:49] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2108
[13:08:50] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:08:58] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:09:01] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2015.codfw.wmnet
[13:09:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:09:38] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2109.codfw.wmnet
[13:09:46] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140703 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by akosiaris@cumin1002 Renumbering for host wikikube-worker2109.codfw.wm...
[13:09:51] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2109.codfw.wmnet with OS bullseye
[13:10:02] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140704 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2109.codfw.wmnet with OS bull...
[13:10:40] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2015.codfw.wmnet
[13:11:24] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:11:53] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2396 to wikikube-worker2110 - akosiaris@cumin1002"
[13:11:58] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2396 to wikikube-worker2110 - akosiaris@cumin1002"
[13:11:58] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:11:59] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2110
[13:12:20] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2110
[13:12:41] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:12:59] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2396 to wikikube-worker2110
[13:13:11] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2396 to wikikube-worker2110 completed: - mw2396 (**...
[13:13:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:13:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:14:35] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2397 to wikikube-worker2111
[13:14:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P69047 and previous config saved to /var/cache/conftool/dbconfig/20240912-131436-arnaudb.json
[13:14:58] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:14:58] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2108.codfw.wmnet 58.0.192.10.in-addr.arpa 8.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:15:00] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:15:02] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2108.codfw.wmnet 58.0.192.10.in-addr.arpa 8.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:15:02] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2108
[13:15:13] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2108
[13:15:13] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2108
[13:15:31] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2109
[13:16:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:17:05] <MatmaRex>	 i guess there's no backport window then
[13:18:09] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on poolcounter2006.codfw.wmnet with reason: host reimage
[13:18:21] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443)
[13:18:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10140713 (10elukey) Something not really great: on sretest2001 one of the 10G interfaces has a link up, that I can confirm via BIOS, but not via Redfish.  {F57502926}  `...
[13:19:01] <hashar>	 MatmaRex: I am here
[13:19:02] <hashar>	 sorry
[13:19:03] <hashar>	 :)
[13:19:25] <cscott>	 hurray!
[13:19:32] <MatmaRex>	 oh! thanks
[13:19:45] <hashar>	 we should elevate you both to the rank of deployer
[13:19:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P69048 and previous config saved to /var/cache/conftool/dbconfig/20240912-131948-ladsgroup.json
[13:19:50] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:19:59] <cscott>	 so much responsibility :(
[13:20:14] <wikibugs>	 (03PS2) 10Ladsgroup: admin: Add echukwukere to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072311 (https://phabricator.wikimedia.org/T374386)
[13:20:15] <hashar>	 it sounds scarier than it really is
[13:20:19] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] admin: Add echukwukere to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072311 (https://phabricator.wikimedia.org/T374386) (owner: 10Ladsgroup)
[13:20:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) (owner: 10Bartosz Dziewoński)
[13:20:37] <hashar>	 lets fix WikimediaDebug
[13:20:41] <James_F>	 <3
[13:20:42] <cscott>	 i think i might actually still technically have the permission bits, from when parsoid was a service, but i haven't deployed in like 5 years.
[13:20:45] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: continue haproxykafka tests on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1072484 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[13:20:46] <hashar>	 sorry I did not spot the missing `$`
[13:20:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[13:20:47] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:20:54] <James_F>	 I ran into that last night and worried I was doing it wrongly.
[13:21:11] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Fix WikimediaDebug "Verbose logging" option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) (owner: 10Bartosz Dziewoński)
[13:21:15] <James_F>	 Thanks for fixing so promptly!
[13:21:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P69049 and previous config saved to /var/cache/conftool/dbconfig/20240912-132116-arnaudb.json
[13:21:18] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on poolcounter2006.codfw.wmnet with reason: host reimage
[13:21:31] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072330|logging: Fix WikimediaDebug "Verbose logging" option (T374583)]]
[13:21:35] <stashbot>	 T374583: Uncaught UnexpectedValueException: Udp transport "udp:///XWikimediaDebug" must specify a host - https://phabricator.wikimedia.org/T374583
[13:21:41] <hashar>	 cscott: sounds like you are all set so! :]   Next thing is using `scap backport 1071067` knowing how to use the wikimedia debug extension and that is pretyt much it
[13:21:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1079 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:21:53] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: continue haproxykafka tests on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1072484 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[13:22:00] <wikibugs>	 (03PS3) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528
[13:22:07] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10140725 (10elukey) Nasty issue found for sretest2001: T365167#10140713  In the provision cookbook we loop through t...
[13:22:25] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[13:22:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:22:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619 (10cmooney) 03NEW p:05Triage→03Low
[13:22:51] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:22:59] <cscott>	 yeah, except for the "knowing what to do if things go wrong" part
[13:23:09] <hashar>	 that is where releng is useful :D
[13:23:14] <hashar>	 we need a panic button really
[13:23:31] <hashar>	 that scream loudly: RELENG COME ASSIST PLEASE 
[13:23:31] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2397 to wikikube-worker2111 - akosiaris@cumin1002"
[13:23:39] <logmsgbot>	 !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1072330|logging: Fix WikimediaDebug "Verbose logging" option (T374583)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:23:58] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2397 to wikikube-worker2111 - akosiaris@cumin1002"
[13:23:58] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:23:58] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2111
[13:24:03] <hashar>	 tested, I can't see errors anymore
[13:24:05] <logmsgbot>	 !log hashar@deploy1003 matmarex, hashar: Continuing with sync
[13:24:09] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2111
[13:24:21] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2109 - akosiaris@cumin1002"
[13:24:25] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2109 - akosiaris@cumin1002"
[13:24:25] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:24:25] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2109.codfw.wmnet 59.0.192.10.in-addr.arpa 9.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:24:28] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2109.codfw.wmnet 59.0.192.10.in-addr.arpa 9.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:24:29] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2109
[13:24:31] <hashar>	 damn
[13:24:39] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2109
[13:24:39] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2109
[13:24:45] <hashar>	 that `!log` entry spam on irc is really verbose
[13:24:47] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2397 to wikikube-worker2111
[13:24:56] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140757 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2397 to wikikube-worker2111 completed: - mw2397 (**...
[13:25:02] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2107.codfw.wmnet with reason: host reimage
[13:25:14] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2110.codfw.wmnet
[13:25:25] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140759 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by akosiaris@cumin1002 Renumbering for host wikikube-worker2110.codfw.wm...
[13:25:29] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2110.codfw.wmnet with OS bullseye
[13:25:39] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2110
[13:25:41] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140760 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2110.codfw.wmnet with OS bull...
[13:25:53] <wikibugs>	 (03CR) 10Ladsgroup: "I confirmed it oob." [puppet] - 10https://gerrit.wikimedia.org/r/1072308 (https://phabricator.wikimedia.org/T374008) (owner: 10Ladsgroup)
[13:25:54] <wikibugs>	 (03PS2) 10Muehlenhoff: puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443)
[13:25:57] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2398 to wikikube-worker2112
[13:26:12] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:26:24] <hashar>	  /ignore logmsgbot
[13:26:26] <hashar>	  /ignore wikibugs
[13:26:29] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2111.codfw.wmnet
[13:26:41] <hashar>	 yeah that is quieter
[13:26:42] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2111.codfw.wmnet with OS bullseye
[13:26:43] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140764 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by akosiaris@cumin1002 Renumbering for host wikikube-worker2111.codfw.wm...
[13:26:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:26:52] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2111.codfw.wmnet with OS bull...
[13:28:10] <hashar>	 and my guess is the deployment is going to fail as the kubernetes  workers are being reimaged
[13:28:12] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2107.codfw.wmnet with reason: host reimage
[13:28:15] <claime>	 hashar: nope
[13:28:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[13:28:26] <claime>	 workers are depooled
[13:28:32] <hashar>	 awesome! :-]
[13:28:37] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072330|logging: Fix WikimediaDebug "Verbose logging" option (T374583)]] (duration: 07m 06s)
[13:28:40] <stashbot>	 T374583: Uncaught UnexpectedValueException: Udp transport "udp:///XWikimediaDebug" must specify a host - https://phabricator.wikimedia.org/T374583
[13:28:47] <hashar>	 claime: I am quite happy that got fixed!
[13:28:49] <claime>	 I've removed the SRE tag from the task for IP renumbering that was spamming -operations as well
[13:28:50] <wikibugs>	 (03PS1) 10Brouberol: cloudnative-pg-cluster: setup good defaults allowing a cluster to be restored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072546 (https://phabricator.wikimedia.org/T372281)
[13:28:59] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet
[13:29:06] <claime>	 so it should be a little quieter
[13:29:09] <hashar>	 thanks
[13:29:28] <hashar>	 for the `!log` spam, I am merely relaying a complain I have seen yesterday or earlier about the same
[13:29:33] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2110 - akosiaris@cumin1002"
[13:29:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T367781)', diff saved to https://phabricator.wikimedia.org/P69050 and previous config saved to /var/cache/conftool/dbconfig/20240912-132943-arnaudb.json
[13:29:47] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[13:29:47] <hashar>	 though that was for helm which is in some case very verbose
[13:29:48] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2110 - akosiaris@cumin1002"
[13:29:48] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:29:49] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2110.codfw.wmnet 60.0.192.10.in-addr.arpa 0.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:29:52] <hashar>	 anyway lets do cscott patch
[13:29:52] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2110.codfw.wmnet 60.0.192.10.in-addr.arpa 0.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:29:53] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2110
[13:30:02] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:30:04] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2110
[13:30:04] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2110
[13:30:09] <cscott>	 MatmaRex might be quicker?
[13:30:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian)
[13:30:15] <cscott>	 the cleanup titles on enwiki will take ~30min
[13:30:25] <claime>	 hashar: that is difficult to address as we do want the cookbooks to log 
[13:30:27] <hashar>	 ah
[13:30:32] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet
[13:30:34] <hashar>	 cscott: MatmaRex well I can deploy both patches
[13:30:36] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2111
[13:30:57] <MatmaRex>	 my other thing is just some maintenance script runs, no related patch
[13:30:59] <hashar>	 hmm no MatmaRex one is just about running the commands as I get it
[13:31:03] <cscott>	 i don't know how long DeleteTag takes on commons
[13:31:10] <wikibugs>	 (03PS4) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528
[13:31:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:31:14] <wikibugs>	 (03Merged) 10jenkins-bot: Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian)
[13:31:22] <hashar>	 well I imagine you can do them in parallel from the mwmaint hosts?
[13:31:26] <MatmaRex>	 a few seconds, there's only a couple of these tags
[13:31:34] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1071067|Elevate pseudo-namespace MOS to a real namespace on enwiki (T363538)]]
[13:31:39] <stashbot>	 T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538
[13:31:44] <cscott>	 yeah, i figured that would be quick(er than mine)
[13:32:08] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2108.codfw.wmnet with reason: host reimage
[13:32:23] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:32:24] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2112
[13:32:36] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:32:45] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:32:50] <hashar>	 and Ican't baby sit it for half+hour cause I have an appointment :/
[13:32:58] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2112
[13:33:34] <logmsgbot>	 !log hashar@deploy1003 cscott, hashar: Backport for [[gerrit:1071067|Elevate pseudo-namespace MOS to a real namespace on enwiki (T363538)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:33:36] <hashar>	 MatmaRex: should I run the first deleteTag?
[13:33:37] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2398 to wikikube-worker2112
[13:33:45] <logmsgbot>	 !log hashar@deploy1003 cscott, hashar: Continuing with sync
[13:33:55] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2399 to wikikube-worker2113
[13:33:56] <cscott>	 i can baby sit, maybe I can sit in the screen?  a chance to see if i actually have the right permission bits i guess.
[13:34:25] <hashar>	 cscott: if you are in the deployer group, that should work
[13:34:26] <wikibugs>	 (03PS1) 10Slyngshede: Menu: Add menu entry for managers to view pending permission requests. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072547
[13:34:34] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2112.codfw.wmnet
[13:34:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:34:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10140791 (10cmooney)
[13:34:48] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2112.codfw.wmnet with OS bullseye
[13:34:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P69051 and previous config saved to /var/cache/conftool/dbconfig/20240912-133456-ladsgroup.json
[13:35:11] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[13:35:15] <MatmaRex>	 hashar: you could if you want. this can wait for another day though if you're already doing the other thing
[13:35:31] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2108.codfw.wmnet with reason: host reimage
[13:35:43] <hashar>	 Script '/srv/mediawiki-staging/php-1.43.0-wmf.22/maintenance/DeleteTag' not found 
[13:35:44] <hashar>	 hehe
[13:35:48] <hashar>	 I guess it is from an extension?
[13:35:53] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2111 - akosiaris@cumin1002"
[13:35:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:36:06] <wikibugs>	 (03PS5) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443)
[13:36:13] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2111 - akosiaris@cumin1002"
[13:36:13] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:36:13] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2111.codfw.wmnet 61.0.192.10.in-addr.arpa 1.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:36:13] <MatmaRex>	 hmm, maybe mwscript is dumber than i thought
[13:36:17] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2111.codfw.wmnet 61.0.192.10.in-addr.arpa 1.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:36:17] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2111
[13:36:20] <cscott>	 deleteTag
[13:36:21] <wikibugs>	 (03CR) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[13:36:22] <cscott>	 initial lowercase
[13:36:24] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[13:36:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P69052 and previous config saved to /var/cache/conftool/dbconfig/20240912-133623-arnaudb.json
[13:36:25] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:36:26] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Persistence-SRE, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#10140794 (10ABran-WMF)
[13:36:27] <hashar>	 tried path '/srv/mediawiki-staging/php-1.43.0-wmf.22/maintenance/DeleteTag.php' and class '/srv/mediawiki-staging/php-1\43\0-wmf\22/maintenance/DeleteTag'
[13:36:28] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2111
[13:36:28] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2111
[13:36:35] <cscott>	 MatmaRex: deleteTag.php
[13:36:35] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:36:44] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2112
[13:36:50] <cscott>	 hashar: deleteTag.php (sorry)
[13:36:51] <hashar>	 ah it is run.php DeleteTag
[13:36:56] <MatmaRex>	 cscott: hashar: uppercase worked with run.php for me 🤷‍♂️
[13:37:11] <MatmaRex>	 yeah, try deleteTag.php  instead
[13:37:24] <cscott>	 probably mwscript and run.php diverge
[13:37:41] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host poolcounter2006.codfw.wmnet with OS bookworm
[13:37:41] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host poolcounter2006.codfw.wmnet
[13:38:14] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071067|Elevate pseudo-namespace MOS to a real namespace on enwiki (T363538)]] (duration: 06m 39s)
[13:38:18] <stashbot>	 T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538
[13:38:19] <hashar>	 MatmaRex: https://phabricator.wikimedia.org/T373700#10140801
[13:38:39] <hashar>	 cscott: your namespace patch is deployed, so I guess you can run the namespace dupe script from the mwmaint server
[13:38:43] <MatmaRex>	 hashar: nice. that looks good
[13:39:02] <MatmaRex>	 thank you!
[13:39:06] <hashar>	 MatmaRex: \o/
[13:39:14] <wikibugs>	 (03PS2) 10Ladsgroup: admin: Add Philippe Saade to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072308 (https://phabricator.wikimedia.org/T374008)
[13:39:18] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] admin: Add Philippe Saade to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072308 (https://phabricator.wikimedia.org/T374008) (owner: 10Ladsgroup)
[13:39:34] <hashar>	 for the error log to group0 , that would wait next week I think. I am too busy today :/
[13:39:39] <cscott>	 yeah, i can ssh to mwmaint1002, is that equivalent of saying I have the required permission bits?
[13:39:46] <hashar>	 maybe!
[13:40:01] <wikibugs>	 (03PS5) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528
[13:40:17] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2109.codfw.wmnet with reason: host reimage
[13:40:31] <hashar>	 $ ./modules/admin/data/matrix.py cscott
[13:40:31] <hashar>	 groups/users	cscott
[13:40:31] <hashar>	 deployment	OK
[13:40:38] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: httpbb: add article-models namespace tests for articlequality [puppet] - 10https://gerrit.wikimedia.org/r/1063213 (https://phabricator.wikimedia.org/T360455)
[13:40:48] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for EChukwukere-WMF - https://phabricator.wikimedia.org/T374386#10140797 (10Ladsgroup) 05Open→03Resolved a:05eoghan→03Ladsgroup https://ldap.toolforge.org/user/echukwukere
[13:41:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:41:12] <cscott>	 ok, i started a tmux, wish me luck :)
[13:41:22] <hashar>	 so you can start a screen,  use `script T363538.log` to keep a log file of the output,  and run the namespace dupes command it
[13:41:29] <hashar>	 or tmux :D
[13:42:25] <hashar>	 also `!log` here the command you are running :]
[13:42:25] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:42:32] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2399 to wikikube-worker2113 - akosiaris@cumin1002"
[13:42:37] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2399 to wikikube-worker2113 - akosiaris@cumin1002"
[13:42:37] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:42:38] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2113
[13:42:55] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2109.codfw.wmnet with reason: host reimage
[13:42:58] <hashar>	 !log Afternoon backport deployments are completed . NamespaceDupe is being run on enwiki for T363538#10140642
[13:43:00] <cscott>	 !log mwscript namespaceDupes enwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee ~/T363538-enwiki-namespaceDupes
[13:43:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:15] <wikibugs>	 (03PS6) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528
[13:43:54] <hashar>	 2>&1 !
[13:43:56] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2113
[13:44:12] <wikibugs>	 (03PS5) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[13:44:26] <cscott>	 good call
[13:44:34] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2399 to wikikube-worker2113
[13:44:39] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:44:39] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2112.codfw.wmnet 62.0.192.10.in-addr.arpa 2.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:44:41] <wikibugs>	 (03PS7) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528
[13:44:42] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2112.codfw.wmnet 62.0.192.10.in-addr.arpa 2.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:44:43] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2112
[13:44:54] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2112
[13:44:55] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2112
[13:45:48] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2110.codfw.wmnet with reason: host reimage
[13:46:03] <wikibugs>	 (03PS6) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[13:46:17] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2113.codfw.wmnet
[13:46:34] <logmsgbot>	 !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2113.codfw.wmnet
[13:47:11] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2113.codfw.wmnet
[13:47:18] <hashar>	 `script` is quite nice since it records the raw terminal data with timing
[13:47:25] <hashar>	 so you can literally replay the session :)
[13:47:27] <logmsgbot>	 !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2113.codfw.wmnet
[13:47:30] <hashar>	 but yeah that is hardcore
[13:47:35] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2107.codfw.wmnet with OS bullseye
[13:47:51] <wikibugs>	 (03PS3) 10Abijeet Patro: Enable message group subscription feature for Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386)
[13:48:42] <akosiaris>	 !log homer lsw1-a3-codfw* commit 'T372878'
[13:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:45] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[13:49:18] <cscott>	 !log namespaceDupes crashed on MOS:_OVERLINKING, re-running with --add-suffix
[13:49:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:24] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2110.codfw.wmnet with reason: host reimage
[13:49:26] <cscott>	 (we saw this on Tuesday on aswiki as well)
[13:49:32] <wikibugs>	 (03CR) 10Brouberol: rdf-streaming-updater: switch to calico-based network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[13:49:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] codfw1dev: enable new vxlan-based subnet CIDR in cloudgw and keystone [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[13:50:01] <cscott>	 !log mwscript namespaceDupes enwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-suffix=/T363538 --fix 2>&1 | tee ~/T363538-enwiki-namespaceDupes.take2
[13:50:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T370903)', diff saved to https://phabricator.wikimedia.org/P69054 and previous config saved to /var/cache/conftool/dbconfig/20240912-135003-ladsgroup.json
[13:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:08] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[13:50:39] <akosiaris>	 !log homer cr*codfw* commit 'T372878'
[13:50:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:07] <wikibugs>	 (03CR) 10Hnowlan: php8.1: add php8.1-uuid to php8.1-cli and cascade (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French)
[13:51:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T367781)', diff saved to https://phabricator.wikimedia.org/P69055 and previous config saved to /var/cache/conftool/dbconfig/20240912-135131-arnaudb.json
[13:51:33] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[13:51:35] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[13:51:36] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[13:51:38] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2113.codfw.wmnet on all recursors
[13:51:42] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2113.codfw.wmnet on all recursors
[13:51:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T367781)', diff saved to https://phabricator.wikimedia.org/P69056 and previous config saved to /var/cache/conftool/dbconfig/20240912-135142-arnaudb.json
[13:51:50] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2113.codfw.wmnet
[13:52:03] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[13:52:05] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2113.codfw.wmnet with OS bullseye
[13:52:09] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2111.codfw.wmnet with reason: host reimage
[13:52:12] <jinxer-wm>	 FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[13:52:15] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2113
[13:52:30] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[13:52:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T367781)', diff saved to https://phabricator.wikimedia.org/P69057 and previous config saved to /var/cache/conftool/dbconfig/20240912-135251-arnaudb.json
[13:52:53] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw,
[13:52:53] <icinga-wm>	 /IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:55:13] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10140864 (10Ladsgroup) 05Stalled→03Resolved https://ldap.toolforge.org/user/philippesaade
[13:55:45] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2111.codfw.wmnet with reason: host reimage
[13:55:49] <akosiaris>	 the lsw1-a3 alert is because the hosts are still in reimaging
[13:56:14] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2107.codfw.wmnet
[13:56:14] <akosiaris>	 I 've pushed it as much as I could in doing things in parallel, and well, there's race conditions alright
[13:56:16] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2107.codfw.wmnet
[13:56:17] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2107.codfw.wmnet
[13:56:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10140887 (10ssingh)
[13:56:22] <cscott>	 !log mwscript cleanupTitles enwiki 2>&1 | tee ~/T363538-enwiki-cleanupTitles
[13:56:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:30] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2108.codfw.wmnet with OS bullseye
[13:56:32] <wikibugs>	 (03PS7) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[13:57:44] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 315, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:57:45] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes mw2390 and mw2394-mw2399 - https://phabricator.wikimedia.org/T374622 (10akosiaris) 03NEW
[13:57:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10140885 (10ssingh) Thanks for filing this task! This is indeed something we have discussed in the past but not formally so let's use this task to do th...
[13:59:01] <wikibugs>	 (03PS4) 10Ssingh: wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045
[14:00:05] <jouncebot>	 denisse and godog: I, the Bot under the Fountain, call upon thee, The Deployer, to do Alert hosts failover to alert2002 deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1400).
[14:00:34] <sukhe>	 oh fun. gl denisse and godog!
[14:00:45] <godog>	 hehe thank you sukhe 
[14:00:48] <denisse>	 sukhe: Thanks! 🤞
[14:00:54] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2112.codfw.wmnet with reason: host reimage
[14:01:09] <cscott>	 the cleanup titles maintenance script is still running on mwmaint1002, i assume that doesn't conflict with the alert hosts stuff?
[14:01:23] <godog>	 cscott: that's right yeah, thank you for the heads up tho
[14:01:33] <denisse>	 !log Enable the alert[12]002 hosts as alertmanagers
[14:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:45] <denisse>	 !log Enable the alert[12]002 hosts as alertmanagers - T372418
[14:01:47] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] alert: Enable the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1072318 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[14:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:48] <stashbot>	 T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418
[14:02:35] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2109.codfw.wmnet with OS bullseye
[14:02:44] <denisse>	 !log Disable meta-monitoring for the alert hosts - T372418
[14:02:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:07] <wikibugs>	 (03PS3) 10Scott French: php8.1: add php8.1-uuid to php8.1-cli and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602)
[14:04:19] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2112.codfw.wmnet with reason: host reimage
[14:04:25] <denisse>	 !log Make alert2002 the active host - T372418
[14:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:37] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] alert: Failover from alert1001 to alert2002 [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[14:04:44] <wikibugs>	 (03CR) 10Scott French: php8.1: add php8.1-uuid to php8.1-cli and cascade (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French)
[14:05:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:07:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439) (owner: 10Hamish)
[14:07:25] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 397, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:07:47] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1072326 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[14:07:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P69058 and previous config saved to /var/cache/conftool/dbconfig/20240912-140758-arnaudb.json
[14:09:12] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2113 - akosiaris@cumin1002"
[14:09:16] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2113 - akosiaris@cumin1002"
[14:09:16] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:09:16] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2113.codfw.wmnet 63.0.192.10.in-addr.arpa 3.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:09:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:09:19] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2113.codfw.wmnet 63.0.192.10.in-addr.arpa 3.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:09:20] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2113
[14:09:22] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2113
[14:09:23] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2113
[14:10:04] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2110.codfw.wmnet with OS bullseye
[14:13:36] <icinga-wm>	 PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[14:13:54] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:14:34] <jynus>	 I can access wikitech static, so potentially an ongoing maintenance fallout?
[14:15:06] <sukhe>	 wfm too
[14:15:19] <godog>	 yeah likely that
[14:15:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh)
[14:15:34] <jynus>	 the other alert looks more interesting
[14:15:54] <hnowlan>	 the other is almost certainly related to the reimaging of wikikube-worker2110.codfw.wmnet 
[14:16:01] <hnowlan>	 or wikikube-worker2113 more accurately 
[14:16:17] <wikibugs>	 (03PS5) 10Ssingh: wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045
[14:16:31] <jynus>	 ah, ok then, as that looked more "real"
[14:17:58] <sukhe>	 !log sudo cumin "A:dnsbox" "rm /etc/ntp.conf": cleaning up ntpd configuration file to avoid confusion with ntpsec.conf
[14:18:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:38] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2109.codfw.wmnet
[14:18:41] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2109.codfw.wmnet
[14:18:41] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2109.codfw.wmnet
[14:20:03] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: add process-based readiness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517)
[14:21:50] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Only run puppetserver spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1072505 (owner: 10Muehlenhoff)
[14:23:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P69059 and previous config saved to /var/cache/conftool/dbconfig/20240912-142306-arnaudb.json
[14:23:33] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2112.codfw.wmnet with OS bullseye
[14:23:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:25:34] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2113.codfw.wmnet with reason: host reimage
[14:27:56] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Switch from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1072551
[14:28:16] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2113.codfw.wmnet with reason: host reimage
[14:28:24] <wikibugs>	 (03CR) 10JHathaway: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[14:28:25] <wikibugs>	 (03PS1) 10Slyngshede: Allow users to see rejected requests for permissions. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072552
[14:29:10] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1072551 (owner: 10EoghanGaffney)
[14:29:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh)
[14:29:37] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] php8.1: add php8.1-uuid to php8.1-cli and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French)
[14:30:07] <cscott>	 !log cleanupTitles on enwiki complete (T363538)
[14:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:13] <stashbot>	 T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538
[14:30:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2004.codfw.wmnet - https://phabricator.wikimedia.org/T374594#10141048 (10Jhancock.wm) a:03Jhancock.wm
[14:30:43] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10141051 (10MBinder_WMF) Thanks, both. I tried to ssh into phab1004.eqiad.wmnet with bast1003.wikimedia.org in the config file, and got the same issue....
[14:30:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2004.codfw.wmnet - https://phabricator.wikimedia.org/T374594#10141054 (10Jhancock.wm) 05Open→03Resolved
[14:31:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10141040 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[14:32:02] <hnowlan>	 win 26
[14:32:10] <hnowlan>	 oops
[14:33:25] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10141070 (10Jhancock.wm) part arriving today. will update when swapped.
[14:33:42] <cscott>	 cleanup titles took just over 30min to complete on enwiki, as Lucas_WMDE predicted
[14:33:58] <wikibugs>	 (03PS6) 10Ssingh: wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045
[14:34:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374422#10141072 (10Jhancock.wm) part should be arriving some time today. we can schedule down time for the server to get it swapped when ready.
[14:37:57] <Lucas_WMDE>	 cscott: sorry, I was at a department offsite all day today. glad to hear it worked out!
[14:38:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T367781)', diff saved to https://phabricator.wikimedia.org/P69060 and previous config saved to /var/cache/conftool/dbconfig/20240912-143813-arnaudb.json
[14:38:18] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[14:38:29] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:38:45] <wikibugs>	 (03PS6) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443)
[14:38:49] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:39:09] <wikibugs>	 (03CR) 10Bking: rdf-streaming-updater: switch to calico-based network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[14:39:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:41:37] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: refactor _config_dell_pxe() [cookbooks] - 10https://gerrit.wikimedia.org/r/1072553 (https://phabricator.wikimedia.org/T365372)
[14:42:41] <cscott>	 Lucas_WMDE: no worries, thanks for writing such a clean postmortem for me to follow when I had to do it myself!
[14:42:49] * cscott was very nervous
[14:43:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes mw2390 and mw2394-mw2399 - https://phabricator.wikimedia.org/T374622#10141130 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[14:45:01] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:45:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10141140 (10phaultfinder)
[14:45:34] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[14:47:30] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10141141 (10elukey) >>! In T365372#10140725, @elukey wrote: > Nasty issue found for sretest2001: T365167#10140713 >...
[14:47:34] <wikibugs>	 (03CR) 10Nikerabbit: [C:03+1] Enable message group subscription feature for Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[14:47:46] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2113.codfw.wmnet with OS bullseye
[14:48:11] <wikibugs>	 (03CR) 10Arnaudb: [C:04-1] "temporary -1 to reduce in progress" [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb)
[14:48:19] <wikibugs>	 (03PS3) 10Muehlenhoff: puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443)
[14:50:59] <icinga-wm>	 RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 24.11 ms
[14:52:33] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:52:35] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[14:52:43] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:52:50] <wikibugs>	 (03CR) 10JHathaway: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[14:55:29] <wikibugs>	 (03PS1) 10Hamish: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072556 (https://phabricator.wikimedia.org/T374621)
[14:55:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072227 (https://phabricator.wikimedia.org/T374484) (owner: 10Superzerocool)
[14:57:22] <wikibugs>	 (03PS7) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443)
[14:57:42] <wikibugs>	 (03CR) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[14:59:15] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:59:26] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:59:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[15:00:04] <jouncebot>	 dduvall and dancy: Your horoscope predicts another Train log triage deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1500).
[15:00:29] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[15:00:32] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[15:03:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 2 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10141209 (10elukey) 05Open→03Resolved
[15:03:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:04:12] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:04:23] <wikibugs>	 (03PS8) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443)
[15:06:04] <wikibugs>	 (03CR) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[15:09:04] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[15:11:03] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[15:11:07] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[15:11:34] <wikibugs>	 (03PS8) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[15:12:38] <wikibugs>	 (03PS4) 10Muehlenhoff: puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443)
[15:14:39] <wikibugs>	 (03PS5) 10Muehlenhoff: puppetmaster::frontend|backend: Read the puppet-merge server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443)
[15:15:18] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the reviews!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French)
[15:15:21] <wikibugs>	 (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French)
[15:16:23] <wikibugs>	 (03CR) 10Bking: [C:03+1] Enable the performace CPU governor on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072529 (https://phabricator.wikimedia.org/T365878) (owner: 10Btullis)
[15:17:06] <wikibugs>	 (03CR) 10Clément Goubert: [C:04-1] shellbox-video: add process-based readiness check (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[15:18:55] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[15:20:23] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php {fawikiquote,fawikisource,fawiktionary} --skip /home/zabe/text_table_cleanup/{fawikiquote,fawikisource,fawiktionary} --dump /home/zabe/text_table_dump/{fawikiquote,fawikisource,fawiktionary} --sleep 1 # T183490
[15:20:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:27] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[15:22:58] <wikibugs>	 (03PS3) 10Hashar: tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485)
[15:23:39] <wikibugs>	 (03CR) 10Hashar: "Rebased to clear a trivial conflict with I75c226a7ed1b0dc91b488ed92242ba5c7da84cac" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar)
[15:24:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Alternatively do it first only for 2001 via a host Hiera entry." [puppet] - 10https://gerrit.wikimedia.org/r/1072551 (owner: 10EoghanGaffney)
[15:26:21] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns2006.wikimedia.org [reason: T373102 codfw maintenance]
[15:26:25] <stashbot>	 T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102
[15:27:35] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10141318 (10jcrespo) I've stopped codfw media backups.  @cmooney Would it be possible to get preferencial time  on maintenance...
[15:27:41] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French)
[15:29:05] <claime>	 !log Depooling kubernetes2044.codfw.wmnet kubernetes2045.codfw.wmnet - T373102
[15:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:47] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2044.codfw.wmnet
[15:30:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2044.codfw.wmnet
[15:30:25] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2045.codfw.wmnet
[15:30:49] <wikibugs>	 (03CR) 10JHathaway: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[15:33:37] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2045.codfw.wmnet
[15:33:44] <wikibugs>	 (03PS9) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443)
[15:34:18] <wikibugs>	 (03PS2) 10Hnowlan: shellbox-video: add process-based readiness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517)
[15:37:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[15:37:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2128 db2151 db2170 db2171 db2211 db2212 es2033 es2034 es2039 pc2014 db2209 - T370852', diff saved to https://phabricator.wikimedia.org/P69062 and previous config saved to /var/cache/conftool/dbconfig/20240912-153720-arnaudb.json
[15:37:24] <stashbot>	 T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852
[15:37:28] <wikibugs>	 (03PS3) 10Hnowlan: shellbox-video: add process-based readiness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517)
[15:37:41] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:45:00 on 11 hosts with reason: network maintenance T373101
[15:37:45] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[15:38:00] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on 11 hosts with reason: network maintenance T373101
[15:38:41] <wikibugs>	 (03CR) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[15:38:55] <wikibugs>	 (03CR) 10Hnowlan: shellbox-video: add process-based readiness check (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[15:39:41] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Fix night mode excepted Wikidata namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488 (owner: 10Ebrahim)
[15:40:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool es2034 which was perceived master for es3 - T370852', diff saved to https://phabricator.wikimedia.org/P69063 and previous config saved to /var/cache/conftool/dbconfig/20240912-154008-arnaudb.json
[15:40:50] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10141367 (10ABran-WMF) @cmooney all nodes have been depooled
[15:42:39] <urandom>	 !log depooling ms-fe2012 moss-fe2002 & thanos-fe2003 — T373102 
[15:42:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:43] <stashbot>	 T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102
[15:45:07] <wikibugs>	 (03PS1) 10Jdlrobson: Dark mode: Make LiquidThreads namespace explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562
[15:45:48] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "I think  https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1072562 makes this a lot cleaner (am mostly worried that if Liquid" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487 (owner: 10Ebrahim)
[15:46:02] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] "Thanks. I'd overlooked $wmgLiquidThreadsFrozen :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[15:47:57] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None
[15:48:00] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None
[15:48:32] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] puppetserver: Pass the value of puppet_merge_server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[15:48:49] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None
[15:48:52] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None
[15:48:59] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:49:34] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] puppetmaster::frontend|backend: Read the puppet-merge server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[15:49:53] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on 21 hosts with reason: Move server uplinks codfw racks D1
[15:49:54] <logmsgbot>	 !log cmooney@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 0:20:00 on 21 hosts with reason: Move server uplinks codfw racks D1
[15:50:01] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 21 hosts with reason: Move server uplinks codfw racks D1
[15:50:13] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:50:33] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:50:37] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 21 hosts with reason: Move server uplinks codfw racks D1
[15:50:48] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10141418 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bb570977-8737-4373-95ac-3765685f6e5e) set by cmoon...
[15:50:53] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 21 hosts with reason: Move server uplinks codfw racks D2
[15:51:30] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 21 hosts with reason: Move server uplinks codfw racks D2
[15:51:38] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10141420 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5073d83c-c18b-41a0-aa78-a6da63b209f9) set by cmoon...
[15:51:49] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:53:45] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:56:12] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2111.codfw.wmnet
[15:56:14] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2111.codfw.wmnet
[15:56:15] <logmsgbot>	 !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2111.codfw.wmnet
[15:57:11] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM, let's see what happens" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[15:57:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[15:58:05] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:58:25] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:58:33] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2112.codfw.wmnet
[15:58:36] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2112.codfw.wmnet
[15:58:37] <logmsgbot>	 !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2112.codfw.wmnet
[15:59:48] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2113.codfw.wmnet
[15:59:51] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2113.codfw.wmnet
[15:59:52] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2113.codfw.wmnet
[16:00:05] <jouncebot>	 jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:08] <logmsgbot>	 !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None
[16:00:12] <logmsgbot>	 !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None
[16:00:50] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2110.codfw.wmnet
[16:00:52] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2110.codfw.wmnet
[16:00:54] <logmsgbot>	 !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2110.codfw.wmnet
[16:01:26] <topranks>	 !log move server uplinks in codfw rack D1 from asw-d1-codfw to lsw1-d1-codfw T373102
[16:01:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:30] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox-video: add process-based readiness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[16:01:34] <stashbot>	 T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102
[16:02:58] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh)
[16:03:06] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: add process-based readiness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[16:04:27] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[16:07:09] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns2006.wikimedia.org
[16:07:10] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2006.wikimedia.org
[16:07:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[16:08:21] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[16:08:34] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[16:09:22] <jynus>	 !log restart ms-backup200[12] after maintenance and upgrade
[16:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:23] <icinga-wm>	 PROBLEM - Host ms-backup2001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:12:01] <icinga-wm>	 RECOVERY - Host ms-backup2001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[16:12:12] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[16:12:15] <jynus>	 that's me, apparently my downtime didn't went through
[16:12:38] <jynus>	 nothing to see
[16:12:42] <jynus>	 it was a normal reboot
[16:12:56] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[16:13:34] <arnaudb>	 jynus: its a side effect of the way icinga works, I have the same issue :D
[16:13:52] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[16:14:20] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[16:14:35] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh)
[16:17:46] <wikibugs>	 (03PS1) 10Fabfur: Revert "hiera: continue haproxykafka tests on cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1072565
[16:18:11] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2006.wikimedia.org [reason: [end] T373102 codfw maintenance]
[16:18:14] <stashbot>	 T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102
[16:18:40] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "hiera: continue haproxykafka tests on cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1072565 (owner: 10Fabfur)
[16:19:05] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10141576 (10cmooney) Everything moved successfully, all ports up on the new switch and everything responding to ping again.
[16:19:16] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[16:19:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69064 and previous config saved to /var/cache/conftool/dbconfig/20240912-161916-arnaudb.json
[16:19:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69065 and previous config saved to /var/cache/conftool/dbconfig/20240912-161922-arnaudb.json
[16:19:24] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[16:19:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69066 and previous config saved to /var/cache/conftool/dbconfig/20240912-161927-arnaudb.json
[16:19:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69067 and previous config saved to /var/cache/conftool/dbconfig/20240912-161932-arnaudb.json
[16:19:34] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[16:19:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69068 and previous config saved to /var/cache/conftool/dbconfig/20240912-161937-arnaudb.json
[16:19:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69069 and previous config saved to /var/cache/conftool/dbconfig/20240912-161942-arnaudb.json
[16:19:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69070 and previous config saved to /var/cache/conftool/dbconfig/20240912-161947-arnaudb.json
[16:19:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69071 and previous config saved to /var/cache/conftool/dbconfig/20240912-161952-arnaudb.json
[16:19:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69072 and previous config saved to /var/cache/conftool/dbconfig/20240912-161957-arnaudb.json
[16:20:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69073 and previous config saved to /var/cache/conftool/dbconfig/20240912-162007-arnaudb.json
[16:21:18] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[16:23:00] <wikibugs>	 (03PS1) 10JHathaway: haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008)
[16:23:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway)
[16:23:51] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[16:24:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10141606 (10VRiley-WMF) 05Open→03Resolved @andrea.denisse This drive has been replaced Please let us know if there are any other issues with this unit.
[16:24:07] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[16:24:22] <wikibugs>	 (03PS2) 10JHathaway: haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008)
[16:25:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10141608 (10phaultfinder)
[16:26:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway)
[16:27:24] <wikibugs>	 (03CR) 10Tacsipacsi: "Could this depend on I10e1b24eba946452ba2e18bef67d8a8205fd2e24? At the moment, it doesn’t look like there will be any backward-incompatibl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[16:28:29] <wikibugs>	 (03PS1) 10Ssingh: sre.dns.admin: use set_and_verify for confctl update [cookbooks] - 10https://gerrit.wikimedia.org/r/1072569
[16:28:59] <wikibugs>	 (03PS3) 10JHathaway: haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008)
[16:30:38] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10141626 (10cmooney) Will re-schedule for Tuesday Sep 17th
[16:30:46] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] sre.dns.admin: add cookbook for GeoDNS pool/depool (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[16:30:51] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10141621 (10cmooney) 05Open→03Resolved a:03cmooney
[16:31:03] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2044.codfw.wmnet
[16:31:05] <claime>	 !log Repooling kubernetes2044.codfw.wmnet kubernetes2045.codfw.wmnet - T373102
[16:31:05] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2044.codfw.wmnet
[16:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:09] <stashbot>	 T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102
[16:31:10] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2045.codfw.wmnet
[16:31:12] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2045.codfw.wmnet
[16:32:15] <urandom>	 !log pooling ms-fe2012 moss-fe2002 & thanos-fe2003 — T373102 
[16:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:09] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway)
[16:33:59] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072569 (owner: 10Ssingh)
[16:34:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69075 and previous config saved to /var/cache/conftool/dbconfig/20240912-163422-arnaudb.json
[16:34:26] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[16:34:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69076 and previous config saved to /var/cache/conftool/dbconfig/20240912-163427-arnaudb.json
[16:34:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69077 and previous config saved to /var/cache/conftool/dbconfig/20240912-163433-arnaudb.json
[16:34:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69078 and previous config saved to /var/cache/conftool/dbconfig/20240912-163438-arnaudb.json
[16:34:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69079 and previous config saved to /var/cache/conftool/dbconfig/20240912-163443-arnaudb.json
[16:34:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69080 and previous config saved to /var/cache/conftool/dbconfig/20240912-163448-arnaudb.json
[16:34:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69081 and previous config saved to /var/cache/conftool/dbconfig/20240912-163453-arnaudb.json
[16:34:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69082 and previous config saved to /var/cache/conftool/dbconfig/20240912-163458-arnaudb.json
[16:35:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69083 and previous config saved to /var/cache/conftool/dbconfig/20240912-163503-arnaudb.json
[16:35:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69084 and previous config saved to /var/cache/conftool/dbconfig/20240912-163513-arnaudb.json
[16:36:10] <wikibugs>	 (03PS4) 10Kgraessle: Enable AutoModerator on ukwik [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823)
[16:36:22] <topranks>	 !log disable ports for now unused ports on asw-d1-codfw and asw-d2-codfw T373102
[16:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:25] <stashbot>	 T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102
[16:36:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374642 (10ops-monitoring-bot) 03NEW
[16:37:52] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#10141677 (10akosiaris)
[16:37:55] <wikibugs>	 (03CR) 10Volans: [C:03+1] "conftool has been updated in production, no more blockers" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 (https://phabricator.wikimedia.org/T362893) (owner: 10Giuseppe Lavagetto)
[16:39:12] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: use correct command in process check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072571 (https://phabricator.wikimedia.org/T373517)
[16:39:48] <icinga-wm>	 RECOVERY - dump of s8 in eqiad on backupmon1001 is OK: Last dump for s8 at eqiad (db1171) taken on 2024-09-12 09:09:43 (267 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[16:42:24] <wikibugs>	 (03PS5) 10Kgraessle: Enable AutoModerator on ukwik [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823)
[16:43:22] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 07Kubernetes: "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#10141701 (10akosiaris) We are seeing this as well on WikiKube nodes.  ` 2024-09-12T15:20:55.176734+0...
[16:43:52] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10141703 (10wiki_willy) It looks like it'll be 3 drives minimum from the latest email today, and @Jclark-ctr - you c...
[16:44:20] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:44:38] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:45:12] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52630 bytes in 0.369 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:45:30] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:45:31] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] shellbox-video: use correct command in process check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072571 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[16:46:12] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox-video: use correct command in process check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072571 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[16:47:10] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: use correct command in process check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072571 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[16:48:42] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:49:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69085 and previous config saved to /var/cache/conftool/dbconfig/20240912-164927-arnaudb.json
[16:49:32] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[16:49:32] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.305 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:49:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69086 and previous config saved to /var/cache/conftool/dbconfig/20240912-164933-arnaudb.json
[16:49:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69087 and previous config saved to /var/cache/conftool/dbconfig/20240912-164938-arnaudb.json
[16:49:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69088 and previous config saved to /var/cache/conftool/dbconfig/20240912-164943-arnaudb.json
[16:49:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69089 and previous config saved to /var/cache/conftool/dbconfig/20240912-164948-arnaudb.json
[16:49:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69090 and previous config saved to /var/cache/conftool/dbconfig/20240912-164953-arnaudb.json
[16:49:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69091 and previous config saved to /var/cache/conftool/dbconfig/20240912-164959-arnaudb.json
[16:50:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69092 and previous config saved to /var/cache/conftool/dbconfig/20240912-165003-arnaudb.json
[16:50:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69093 and previous config saved to /var/cache/conftool/dbconfig/20240912-165009-arnaudb.json
[16:50:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69094 and previous config saved to /var/cache/conftool/dbconfig/20240912-165018-arnaudb.json
[16:51:44] <logmsgbot>	 !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@d045bb2]: (no justification provided)
[16:52:14] <logmsgbot>	 !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@d045bb2]: (no justification provided) (duration: 00m 30s)
[16:54:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[16:57:21] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add mos to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1072573 (https://phabricator.wikimedia.org/T374641)
[16:58:03] <wikibugs>	 (03PS3) 10Bking: flink-app: create a new label for selecting Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[16:58:46] <wikibugs>	 (03CR) 10Jsn.sherman: "Some style comments inline, but otherwise this looks good to go. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle)
[16:59:58] <wikibugs>	 (03PS6) 10Kgraessle: Enable AutoModerator on ukwik [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823)
[17:00:05] <jouncebot>	 bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1700).
[17:00:05] <jouncebot>	 swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1700).
[17:00:41] <wikibugs>	 (03CR) 10Kgraessle: Enable AutoModerator on ukwik (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle)
[17:02:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[17:02:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle)
[17:02:34] <swfrench-wmf>	 here o/
[17:03:05] <swfrench-wmf>	 will start work shortly - just getting some other items into a pause-able state
[17:04:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69096 and previous config saved to /var/cache/conftool/dbconfig/20240912-170433-arnaudb.json
[17:04:39] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[17:04:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69097 and previous config saved to /var/cache/conftool/dbconfig/20240912-170439-arnaudb.json
[17:04:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69098 and previous config saved to /var/cache/conftool/dbconfig/20240912-170444-arnaudb.json
[17:04:50] <wikibugs>	 (03PS4) 10Bking: flink-app: create a new label for selecting Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[17:04:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69099 and previous config saved to /var/cache/conftool/dbconfig/20240912-170449-arnaudb.json
[17:04:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69100 and previous config saved to /var/cache/conftool/dbconfig/20240912-170453-arnaudb.json
[17:05:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69101 and previous config saved to /var/cache/conftool/dbconfig/20240912-170459-arnaudb.json
[17:05:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69102 and previous config saved to /var/cache/conftool/dbconfig/20240912-170504-arnaudb.json
[17:05:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69103 and previous config saved to /var/cache/conftool/dbconfig/20240912-170509-arnaudb.json
[17:05:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69104 and previous config saved to /var/cache/conftool/dbconfig/20240912-170514-arnaudb.json
[17:05:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'pc2014 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69105 and previous config saved to /var/cache/conftool/dbconfig/20240912-170524-arnaudb.json
[17:05:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69106 and previous config saved to /var/cache/conftool/dbconfig/20240912-170524-arnaudb.json
[17:07:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[17:08:51] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-debug: add initial "next" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071945 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French)
[17:09:27] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] sre.dns.admin: use set_and_verify for confctl update [cookbooks] - 10https://gerrit.wikimedia.org/r/1072569 (owner: 10Ssingh)
[17:09:50] <wikibugs>	 (03Merged) 10jenkins-bot: mw-debug: add initial "next" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071945 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French)
[17:10:03] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10141906 (10Eevans) We've assumed that 1013 & 1014 are both impacted by the same issue (or I have, at least), but that might not be a safe assumption; I'd like to try reimaging this one as well....
[17:10:25] <bd808>	 nothing for me to deploy in my window today
[17:11:27] <wikibugs>	 (03PS1) 10Fabfur: cache:haproxy: hardcode $schema field [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668)
[17:11:32] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[17:14:23] <jynus>	 !log restarting db1171:s8 mysql process T374610
[17:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:27] <stashbot>	 T374610: db1171:s8 is having performance issues and lagging - https://phabricator.wikimedia.org/T374610
[17:15:31] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[17:16:43] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[17:16:46] <wikibugs>	 (03PS3) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550)
[17:16:49] <wikibugs>	 (03PS1) 10BCornwall: varnish: Consolidate analytics subroutines [puppet] - 10https://gerrit.wikimedia.org/r/1070688 (https://phabricator.wikimedia.org/T370200)
[17:20:08] <wikibugs>	 (03PS1) 10Scott French: Revert "mw-debug: add initial "next" release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072578 (https://phabricator.wikimedia.org/T372604)
[17:21:40] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Revert "mw-debug: add initial "next" release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072578 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French)
[17:22:15] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10141985 (10Eevans) ` eevans@aqs1014:~$ sudo lshw -class disk   *-disk:0                          description: ATA Disk        product: HFS1T9G32FEH-BA1        physical id: 0        bus info: scs...
[17:22:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mw-debug: add initial "next" release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072578 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French)
[17:25:59] <swfrench-wmf>	 all done on my end
[17:26:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10142013 (10phaultfinder)
[17:27:53] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1014.eqiad.wmnet with reason: SSD device troubleshooting
[17:28:09] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1014.eqiad.wmnet with reason: SSD device troubleshooting
[17:28:20] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10142025 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b21f43cd-a8ba-456c-8d9c-3c6cd91457e5) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with...
[17:28:49] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10142033 (10wiki_willy) a:03RobH
[17:33:20] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:33:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10142054 (10cmooney) So once we have completed the move for D4 next Tuesday I have a (hopefully) small request.  Could the sretest2002 uplinks...
[17:33:58] <wikibugs>	 (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[17:35:32] <wikibugs>	 (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[17:35:39] <wikibugs>	 (03CR) 10BCornwall: "Manual patches are still fine, so long as the domain exists in markmonitor. I would also like for this functionality and a report exists a" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[17:37:58] <wikibugs>	 (03Abandoned) 10Ebrahim: Make LQT night mode exceptions explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487 (owner: 10Ebrahim)
[17:38:22] <wikibugs>	 (03CR) 10Ebrahim: "That looks fantastic indeed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487 (owner: 10Ebrahim)
[17:40:30] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh)
[17:42:49] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10142066 (10VRiley-WMF) @Eevans the drives that were not listed in the group have been replaced. Please let us know if anything else is needed.
[17:46:59] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1072586
[17:49:43] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:50:09] <Amir1>	 jouncebot: nowandnext
[17:50:10] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1700)
[17:50:10] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1700)
[17:50:10] <jouncebot>	 In 0 hour(s) and 9 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1800)
[17:50:36] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:52:15] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Add mos to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1072573 (https://phabricator.wikimedia.org/T374641) (owner: 10Gerrit maintenance bot)
[17:53:55] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on aqs1014 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 12, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T374652 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[17:54:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T374652 (10ops-monitoring-bot) 03NEW
[17:54:42] <jinxer-wm>	 FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[17:55:39] <wikibugs>	 (03CR) 10Jsn.sherman: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle)
[18:00:05] <jouncebot>	 dduvall and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1800).
[18:06:51] <wikibugs>	 (03PS5) 10Bking: flink-app: create a new label for selecting Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[18:06:59] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072587 (https://phabricator.wikimedia.org/T373641)
[18:07:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072587 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot)
[18:07:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:07:43] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072587 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot)
[18:13:36] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:18:04] <logmsgbot>	 !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.22  refs T373641
[18:18:08] <stashbot>	 T373641: 1.43.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T373641
[18:18:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10142127 (10Jhancock.wm) yeah can do
[18:20:23] <swfrench-wmf>	 !log ran systemctl reset-failed mediawiki_job_MachineVision_prioritize_uncategorized.service on mwmaint1002 to clear failed state for turned down job - T352884
[18:20:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:27] <stashbot>	 T352884: Undeploy and archive the MachineVision extension - https://phabricator.wikimedia.org/T352884
[18:23:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[18:29:33] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] vrts: swap replica to new host [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth)
[18:33:45] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Migration
[18:34:01] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Migration
[18:35:48] <wikibugs>	 (03PS1) 10BCornwall: wip: Remove rsa support [puppet] - 10https://gerrit.wikimedia.org/r/1072590
[18:46:52] <wikibugs>	 (03PS2) 10BCornwall: Remove RSA certificate support [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837)
[18:50:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10142189 (10phaultfinder)
[18:54:22] <wikibugs>	 (03PS7) 10Bartosz Dziewoński: Enable AutoModerator on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle)
[18:55:08] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "(Fixed typo)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle)
[18:57:13] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] "Thanks, Hugh!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French)
[19:03:49] <swfrench-wmf>	 !log rebuilt php8.1 production images to pick up php-uuid - T372602
[19:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:53] <stashbot>	 T372602: Prepare PHP 8.1 production images - https://phabricator.wikimedia.org/T372602
[19:03:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:09:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) (owner: 10Bartosz Dziewoński)
[19:09:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711 (owner: 10Bartosz Dziewoński)
[19:09:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728 (owner: 10Bartosz Dziewoński)
[19:10:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Drop PSON support - https://phabricator.wikimedia.org/T372667#10142209 (10jhathaway)
[19:11:29] <Amir1>	 jouncebot: nowandnext
[19:11:29] <jouncebot>	 For the next 0 hour(s) and 48 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1800)
[19:11:29] <jouncebot>	 In 0 hour(s) and 48 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T2000)
[19:12:36] <wikibugs>	 (03CR) 10Vgutierrez: "don't forget to remove wikiworkshop's RSA certificate as well" [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[19:19:07] <icinga-wm>	 RECOVERY - Host gerrit1004 is UP: PING WARNING - Packet loss = 33%, RTA = 1.32 ms
[19:23:53] <wikibugs>	 (03PS9) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[19:24:30] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm, nitpick: update topic to say that it's not the active host yet" [puppet] - 10https://gerrit.wikimedia.org/r/1072551 (owner: 10EoghanGaffney)
[19:24:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[19:25:31] <icinga-wm>	 PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100%
[19:31:56] <wikibugs>	 (03PS10) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[19:33:57] <wikibugs>	 (03PS11) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[19:41:16] <wikibugs>	 (03PS1) 10JHathaway: puppet8: ensure kerberos keytab type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667)
[19:41:36] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[19:47:34] <wikibugs>	 (03PS2) 10Fabfur: cache:haproxy: hardcode $schema field [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668)
[19:48:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:52:32] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[19:55:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[19:55:28] <wikibugs>	 (03PS3) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837)
[19:55:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[19:56:11] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[19:57:47] <wikibugs>	 (03PS3) 10Fabfur: cache:haproxy: hardcode $schema field [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668)
[19:57:54] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on vrts2002.codfw.wmnet with reason: Migration
[19:57:59] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on vrts2002.codfw.wmnet with reason: Migration
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T2000).
[20:00:05] <jouncebot>	 Hamishcz, Superzerocool, katherine_g, MatmaRex, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:52] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[20:00:54] <katherine_g>	 here
[20:00:59] <Hamishcz>	 yes
[20:01:11] <cjming>	 hi - i can deploy
[20:01:26] <cjming>	 i'll do Hamishcz's patch first
[20:02:02] <wikibugs>	 (03PS3) 10Hamish: u4cwiki: create case and case_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439)
[20:02:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439) (owner: 10Hamish)
[20:03:24] <wikibugs>	 (03Merged) 10jenkins-bot: u4cwiki: create case and case_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439) (owner: 10Hamish)
[20:03:30] <MatmaRex>	 (hi)
[20:03:35] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072204|u4cwiki: create case and case_talk namespaces (T374439)]]
[20:03:37] <Hamishcz>	 cjming, For test, as I cannot access u4cwiki, I cannot do a real test but the code is wonderful IMO
[20:03:40] <stashbot>	 T374439: Create case and case_talk namespaces in u4cwiki - https://phabricator.wikimedia.org/T374439
[20:03:59] <cjming>	 Hamishcz: np - i'll sync and run the namespace dupes script on it
[20:04:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 15.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:04:25] <wikibugs>	 (03PS4) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837)
[20:04:30] <Hamishcz>	 sure thanks a lot
[20:04:38] <cjming>	 np!
[20:04:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[20:05:30] <Jdlrobson>	 o/
[20:06:23] <logmsgbot>	 !log cjming@deploy1003 hamishz, cjming: Backport for [[gerrit:1072204|u4cwiki: create case and case_talk namespaces (T374439)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:06:38] <logmsgbot>	 !log cjming@deploy1003 hamishz, cjming: Continuing with sync
[20:07:09] <cjming>	 Superzerocool: are you around?
[20:09:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 15.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:10:37] <cjming>	 katherine_g: i'll do yours next
[20:10:45] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater: trigger a savepoint before firewall changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072597
[20:10:54] <katherine_g>	 sounds good
[20:10:59] <wikibugs>	 (03PS8) 10Bartosz Dziewoński: Enable AutoModerator on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle)
[20:11:11] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072204|u4cwiki: create case and case_talk namespaces (T374439)]] (duration: 07m 36s)
[20:11:15] <stashbot>	 T374439: Create case and case_talk namespaces in u4cwiki - https://phabricator.wikimedia.org/T374439
[20:11:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle)
[20:12:16] <cjming>	 Hamishcz: your change should be live
[20:12:30] <wikibugs>	 (03Merged) 10jenkins-bot: Enable AutoModerator on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle)
[20:12:44] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1071223|Enable AutoModerator on ukwiki (T373823)]]
[20:12:47] <stashbot>	 T373823: Enable AutoModerator on ukwiki - https://phabricator.wikimedia.org/T373823
[20:12:53] <MatmaRex>	 if we get to my patches today, you can do them all at once and without testing on mwdebug – they are all only removing completely unused config variables, checked in codesearch
[20:13:22] <cjming>	 MatmaRex: sounds good and will do
[20:14:23] <katherine_g>	 i'm good to sync
[20:14:28] <Hamishcz>	 cjming, yeh I confirmed its live status in repo, but I cannot really see it, will contact someone to confirm, However it's a easy code so basically no problem
[20:14:38] <logmsgbot>	 !log cjming@deploy1003 kgraessle, cjming: Backport for [[gerrit:1071223|Enable AutoModerator on ukwiki (T373823)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:14:40] <Hamishcz>	 appreciate
[20:14:43] <cjming>	 np!
[20:14:47] <logmsgbot>	 !log cjming@deploy1003 kgraessle, cjming: Continuing with sync
[20:14:54] <katherine_g>	 thanks! 
[20:14:58] <cjming>	 yw!
[20:15:17] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406)
[20:15:22] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Remove unused $wmgPoweredByMediaWikiIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711
[20:15:30] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Remove unused settings removed in T339959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728
[20:19:45] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071223|Enable AutoModerator on ukwiki (T373823)]] (duration: 07m 01s)
[20:19:49] <stashbot>	 T373823: Enable AutoModerator on ukwiki - https://phabricator.wikimedia.org/T373823
[20:19:53] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) (owner: 10Bartosz Dziewoński)
[20:20:39] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) (owner: 10Bartosz Dziewoński)
[20:20:51] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Remove unused $wmgPoweredByMediaWikiIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711
[20:22:21] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Remove unused $wmgPoweredByMediaWikiIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711 (owner: 10Bartosz Dziewoński)
[20:23:04] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused $wmgPoweredByMediaWikiIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711 (owner: 10Bartosz Dziewoński)
[20:23:04] <cjming>	 katherine_g: your patch should be live!
[20:23:20] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Remove unused settings removed in T339959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728
[20:23:34] <katherine_g>	 k looks good! 
[20:24:24] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Remove unused settings removed in T339959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728 (owner: 10Bartosz Dziewoński)
[20:25:16] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused settings removed in T339959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728 (owner: 10Bartosz Dziewoński)
[20:25:39] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1065299|Remove unused $wgAllowRequiringEmailForResets (T242406)]], [[gerrit:1071711|Remove unused $wmgPoweredByMediaWikiIcon]], [[gerrit:1071728|Remove unused settings removed in T339959]]
[20:25:44] <stashbot>	 T242406: Remove $wgAllowRequiringEmailForResets feature flag [small] - https://phabricator.wikimedia.org/T242406
[20:25:44] <stashbot>	 T339959: Reduce CentralAuth complexity by removing unused settings - https://phabricator.wikimedia.org/T339959
[20:28:22] <logmsgbot>	 !log cjming@deploy1003 matmarex, cjming: Backport for [[gerrit:1065299|Remove unused $wgAllowRequiringEmailForResets (T242406)]], [[gerrit:1071711|Remove unused $wmgPoweredByMediaWikiIcon]], [[gerrit:1071728|Remove unused settings removed in T339959]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:28:26] <logmsgbot>	 !log cjming@deploy1003 matmarex, cjming: Continuing with sync
[20:29:45] <wikibugs>	 (03PS2) 10Bking: rdf-streaming-updater: trigger a savepoint before firewall changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072597 (https://phabricator.wikimedia.org/T373195)
[20:30:10] <wikibugs>	 (03PS1) 10Ebrahim: Remove ProofreadPage exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600
[20:30:20] <wikibugs>	 (03PS21) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380)
[20:31:09] <wikibugs>	 (03CR) 10Ebrahim: "Just FYI that the extension is getting fixed also." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim)
[20:32:05] <Superzerocool>	 oh God, I'm so late for deploy :(
[20:32:49] <cjming>	 Superzerocool: no worries! good timing actually - i can do yours here shortly
[20:32:58] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1065299|Remove unused $wgAllowRequiringEmailForResets (T242406)]], [[gerrit:1071711|Remove unused $wmgPoweredByMediaWikiIcon]], [[gerrit:1071728|Remove unused settings removed in T339959]] (duration: 07m 19s)
[20:33:03] <stashbot>	 T242406: Remove $wgAllowRequiringEmailForResets feature flag [small] - https://phabricator.wikimedia.org/T242406
[20:33:03] <stashbot>	 T339959: Reduce CentralAuth complexity by removing unused settings - https://phabricator.wikimedia.org/T339959
[20:33:08] <cjming>	 MatmaRex: all your patches should be live!
[20:33:17] <cjming>	 Jdlrobson: i'll do yours next
[20:33:20] <MatmaRex>	 cjming: thank you!
[20:33:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[20:33:34] <cjming>	 yw!
[20:33:37] <Superzerocool>	 thanks @cjming :)
[20:34:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[20:34:19] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1063763|Enable the dark mode in Portal namespace (T366380)]]
[20:34:24] <stashbot>	 T366380: Enable portal pages in night theme - https://phabricator.wikimedia.org/T366380
[20:34:25] <Jdlrobson>	 thanks cjming 
[20:34:49] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:35:03] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:36:20] <logmsgbot>	 !log cjming@deploy1003 ebrahim, cjming: Backport for [[gerrit:1063763|Enable the dark mode in Portal namespace (T366380)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:36:24] <cjming>	 Jdlrobson: ready to test - lmk if i should sync
[20:38:01] <Jdlrobson>	 cjming: on it
[20:38:09] <wikibugs>	 (03PS4) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[20:38:34] <Jdlrobson>	 LGTM cjming please sync!
[20:38:40] <logmsgbot>	 !log cjming@deploy1003 ebrahim, cjming: Continuing with sync
[20:38:44] <wikibugs>	 (03PS12) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[20:39:08] <wikibugs>	 (03PS2) 10Superzerocool: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072227 (https://phabricator.wikimedia.org/T374484)
[20:39:44] <wikibugs>	 (03CR) 10Jdlrobson: "Nice!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim)
[20:40:03] <wikibugs>	 (03PS6) 10Bking: flink-app: customize calico label selector  Calico network policies default to matching on "app" label and chartName  value,  but the flink-kubernetes-operator  sets the app label to chartName-release instead. Ref  https://lists.apache.org/thread/dont796lp84vfqnovolryw0y0470mqsv [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[20:40:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[20:40:17] <wikibugs>	 (03PS7) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[20:43:16] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063763|Enable the dark mode in Portal namespace (T366380)]] (duration: 08m 57s)
[20:43:20] <stashbot>	 T366380: Enable portal pages in night theme - https://phabricator.wikimedia.org/T366380
[20:43:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072227 (https://phabricator.wikimedia.org/T374484) (owner: 10Superzerocool)
[20:43:35] <cjming>	 Jdlrobson: should be live!
[20:44:09] <wikibugs>	 (03Merged) 10jenkins-bot: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072227 (https://phabricator.wikimedia.org/T374484) (owner: 10Superzerocool)
[20:44:19] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072227|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374484)]]
[20:44:23] <stashbot>	 T374484: Lift IP cap for 190.12.102.194 and 200.5.117.98 on 2024-10-19 - https://phabricator.wikimedia.org/T374484
[20:44:33] <_Gerges>	 Hi cjming 
[20:44:41] <cjming>	 hi !
[20:45:08] <wikibugs>	 (03CR) 10Dzahn: "PS4 changes:" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[20:45:10] <_Gerges>	 If this could be edited patch, I don't know how I got it wrong I think it's due to autocomplete
[20:45:10] <_Gerges>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1067433
[20:45:58] <cjming>	 Gerges: can you send up a new patch and add it to the deployment cal? i should have time to do it
[20:46:16] <logmsgbot>	 !log cjming@deploy1003 cjming, superzerocool: Backport for [[gerrit:1072227|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374484)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:46:34] <cjming>	 Superzerocool: if your patch can be tested, it's up on mwdebug
[20:47:49] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:47:59] <_Gerges>	 He left 
[20:48:03] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:48:09] <cjming>	 yeah - that's what it looks like
[20:48:26] <MatmaRex>	 heh. could've been a missclick
[20:48:42] <cjming>	 anyway Gerges: happy to do one more config patch if you can get it out the door in the next few minutes
[20:48:51] <wikibugs>	 (03PS2) 10Ebrahim: Remove ProofreadPage exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600
[20:49:01] <cjming>	 Superzerocool: shall i sync?
[20:49:05] <MatmaRex>	 cjming: i don't think there's any way to test an IP cap lift patch anyway, so it seems fine to go ahead, if the details look correct
[20:49:14] <cjming>	 lgtm - syncing!
[20:49:16] <logmsgbot>	 !log cjming@deploy1003 cjming, superzerocool: Continuing with sync
[20:49:18] <Superzerocool>	 hi cjming yep, there is no way to test my patch...
[20:49:20] <MatmaRex>	 (i mean, not untl the date it happens)
[20:50:28] <wikibugs>	 (03PS13) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[20:51:16] <Jdlrobson>	 Thanks cjming  for the help today!
[20:51:24] <cjming>	 ur welcome!
[20:51:52] <Superzerocool>	 thanks cjming :))
[20:52:01] <cjming>	 yw!
[20:52:16] <cjming>	 should be live shortly
[20:52:39] <cjming>	 Gerges: should i wait for your new patch? otherwise i'll close the backport window
[20:53:10] <Hamishcz>	 _Gerges, I could do your patch, if you want me to 
[20:53:14] <_Gerges>	 Wait five minutes 
[20:53:25] <cjming>	 sure - np
[20:53:40] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072227|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374484)]] (duration: 09m 21s)
[20:53:40] <Hamishcz>	 _Gerges, thank you for the quick response and :)
[20:53:44] <stashbot>	 T374484: Lift IP cap for 190.12.102.194 and 200.5.117.98 on 2024-10-19 - https://phabricator.wikimedia.org/T374484
[20:53:56] <wikibugs>	 (03CR) 10Pppery: "(this was written based off of Patch Set 3, some of this may have been done in Patch Set 4)" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[20:54:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[20:55:08] <wikibugs>	 (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[20:55:36] <wikibugs>	 (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[20:56:38] <Superzerocool>	 thanks for your time and service cjming, see you around :)
[20:56:51] <cjming>	 you're welcome!
[20:57:37] <wikibugs>	 (03PS1) 10GergesShamon: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072604
[20:58:07] <_Gerges>	 Thanks for waiting for me
[20:58:18] <cjming>	 np!
[20:58:38] <Hamishcz>	 _Gerges, r u sure you are lifting the cap for a private IP address?
[20:58:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072604 (owner: 10GergesShamon)
[20:59:22] <cjming>	 oops - i already started scap backport for it
[20:59:31] <wikibugs>	 (03Merged) 10jenkins-bot: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072604 (owner: 10GergesShamon)
[20:59:31] <_Gerges>	 @Hamishcz: What do you mean?
[20:59:44] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072604|Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata]]
[21:00:19] <Hamishcz>	 _Gerges, https://meta.wikimedia.org/wiki/Mass_account_creation#Requesting_temporary_lift_of_IP_cap
[21:01:09] <cjming>	 whoops - should i revert?
[21:01:46] <logmsgbot>	 !log cjming@deploy1003 cjming, gergesshamon: Backport for [[gerrit:1072604|Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:01:48] <cjming>	 i guess the original patch might need to be reverted too?
[21:02:11] <cjming>	 i'm thinking i should not sync this patch - please lmk
[21:03:19] <_Gerges>	 Yes
[21:03:20] <Hamishcz>	 I recommend revert T373468 related codes, and fix redundant dbname in L59(currently)
[21:03:20] <stashbot>	 T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468
[21:04:28] <cjming>	 i'm not sure Gerges what you're saying yes to but i'm going to not sync and revert - that ok?
[21:05:23] <_Gerges>	 not sync and revert 
[21:05:39] <logmsgbot>	 !log cjming@deploy1003 Sync cancelled.
[21:06:04] <_Gerges>	 We need to get a public IP, not a private IP.
[21:06:06] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072606
[21:06:06] <wikibugs>	 (03CR) 10TrainBranchBot: "cjming@deploy1003 created a revert of this change as Idedb69e4ddf1fb25e1733406a209d12281b57249" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072604 (owner: 10GergesShamon)
[21:06:11] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651)
[21:06:44] <cjming>	 so Gerges: should we also revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1067433 ?
[21:06:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072606 (owner: 10TrainBranchBot)
[21:06:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651) (owner: 10BryanDavis)
[21:07:23] <wikibugs>	 (03PS1) 10JHathaway: catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667)
[21:07:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072606 (owner: 10TrainBranchBot)
[21:07:31] <_Gerges>	 Yes
[21:07:44] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072606|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]]
[21:08:03] <wikibugs>	 (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[21:08:32] <cjming>	 Gerges: ok i will revert the patch from 8/28 and then call it a day
[21:09:11] <_Gerges>	 Sorry for the delay
[21:09:13] <wikibugs>	 (03PS1) 10Clare Ming: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072609
[21:09:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072609 (owner: 10Clare Ming)
[21:09:41] <logmsgbot>	 !log cjming@deploy1003 cjming, trainbranchbot: Backport for [[gerrit:1072606|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:10:24] <Hamishcz>	 and Gerges, I recommend you to contact the author of T373468 to request a new IP, as they have an activity on 17 Sep
[21:10:25] <stashbot>	 T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468
[21:11:16] <cjming>	 Gerges: seems like i can't do a quick revert via gerrit UI (merge conflicts) so you'll have to send up a revert patch manually -- or if you think you'll get a non-private IP soon, add a new patch to override the IP
[21:11:43] <_Gerges>	 ok
[21:11:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[21:11:51] <cjming>	 my bad - i didn't think to check documentation when we merged your first patch on 8/28
[21:11:57] <Hamishcz>	 quite a busy windows lol cjming
[21:12:04] <cjming>	 lol - it's true
[21:12:12] <wikibugs>	 (03PS2) 10JHathaway: catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667)
[21:12:12] <logmsgbot>	 !log cjming@deploy1003 cjming, trainbranchbot: Continuing with sync
[21:12:37] <cjming>	 thanks Hamishcz for catching that - gtk
[21:13:17] <Hamishcz>	 w/ pleasure :)
[21:13:20] <wikibugs>	 (03Abandoned) 10Clare Ming: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072609 (owner: 10Clare Ming)
[21:13:35] <tzatziki>	 !log removing 1 file for legal compliance
[21:13:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:56] <cjming>	 alrighty since we're over, i'm going to close the window for now
[21:14:06] <_Gerges>	 @cjming: Should I do a patch now to revert?
[21:14:18] <cjming>	 Gerges: sure! i'll wait if you want to send it up
[21:14:29] <cjming>	 shouldn't take long
[21:14:44] <cjming>	 and it looks like there's nothing scheduled after this window so we have time
[21:16:41] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072606|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]] (duration: 08m 57s)
[21:16:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[21:17:31] <wikibugs>	 (03PS3) 10JHathaway: catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667)
[21:18:14] <wikibugs>	 (03PS1) 10Scott French: sre.discovery: set timeout in raw dns.query.udp [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047)
[21:19:03] <wikibugs>	 (03PS2) 10BryanDavis: toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651)
[21:20:11] <wikibugs>	 (03PS1) 10GergesShamon: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072614
[21:20:26] <tzatziki>	 !log removing 1 file for legal compliance
[21:20:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072614 (owner: 10GergesShamon)
[21:20:52] <cjming>	 Gerges: lgtm - i'm going to deploy your revert
[21:20:59] <_Gerges>	 Ok
[21:21:07] <cjming>	 oops - what's up with CI?
[21:21:57] <Hamishcz>	 i guess it's space or empty line related lol
[21:22:58] <cjming>	 ya
[21:23:08] <cjming>	 Gerges: can you fix?
[21:23:12] <Hamishcz>	 exactly..
[21:23:23] <cjming>	 L37 - https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-test/2841/console
[21:23:32] <wikibugs>	 (03PS3) 10BryanDavis: toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651)
[21:23:52] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[21:24:42] <_Gerges>	 What?
[21:24:58] <cjming>	 Gerges: just remove empty lines from your patch
[21:25:18] <wikibugs>	 (03PS1) 10JHathaway: bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072616
[21:25:41] <Hamishcz>	 specifically, remove L37, then everything would be ok
[21:25:43] <cjming>	 er: i think you can leave one empty line - CI doesn't like 2 empty lines
[21:25:49] <wikibugs>	 (03PS2) 10GergesShamon: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072614
[21:26:06] <cjming>	 what Hamishcz said
[21:27:03] <Hamishcz>	 whatever..just leave it
[21:27:14] <Hamishcz>	 not the code is ok
[21:27:17] <Hamishcz>	 now*
[21:27:49] <cjming>	 does that make sense Gerges? otherwise i can do it real quick
[21:28:48] <wikibugs>	 (03PS1) 10JHathaway: pcc: bump version on workers [puppet] - 10https://gerrit.wikimedia.org/r/1072617 (https://phabricator.wikimedia.org/T372667)
[21:29:31] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651) (owner: 10BryanDavis)
[21:29:33] <cjming>	 ah you did it - ok deploying
[21:30:00] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072616 (owner: 10JHathaway)
[21:30:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072614 (owner: 10GergesShamon)
[21:30:08] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] pcc: bump version on workers [puppet] - 10https://gerrit.wikimedia.org/r/1072617 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[21:30:11] <wikibugs>	 (03PS8) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[21:30:34] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651) (owner: 10BryanDavis)
[21:30:42] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072614 (owner: 10GergesShamon)
[21:30:52] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072614|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]]
[21:32:52] <logmsgbot>	 !log cjming@deploy1003 cjming, gergesshamon: Backport for [[gerrit:1072614|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:33:00] <logmsgbot>	 !log cjming@deploy1003 cjming, gergesshamon: Continuing with sync
[21:33:27] <_Gerges>	 Sorry I lost the connection (I'm using IRC Cload, so it didn't appear that I was the only one connected)
[21:34:12] <wikibugs>	 (03PS14) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[21:37:29] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072614|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]] (duration: 06m 37s)
[21:37:43] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply
[21:39:12] <cjming>	 Gerges: no worries - revert is live!
[21:39:31] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[21:39:34] <cjming>	 now i will close the window (hopefully there's nothing else)
[21:40:46] <cjming>	 !log end of UTC late backport window
[21:40:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:40] <_Gerges>	 Thanks
[21:41:51] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10142588 (10thcipriani) >>! In T373969#10130953, @Ladsgroup wrote: > This we...
[21:41:52] <cjming>	 ur welcome
[21:42:20] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142586 (10MBinder_WMF) > The correct combination is phab1004.eqiad.wmnet with bast1003.wikimedia.org.  Attached is my verbose output for that combinati...
[21:43:59] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply
[21:44:04] <wikibugs>	 (03PS1) 10Ebrahim: Remove metawiki dark mode exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623
[21:44:42] <Amir1>	 jouncebot: nowandnext
[21:44:42] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 15 minute(s)
[21:44:42] <jouncebot>	 In 8 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240913T0600)
[21:44:55] <wikibugs>	 (03PS2) 10Ebrahim: Fix night mode excepted Wikidata namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488
[21:44:57] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Fix night mode excepted Wikidata namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488 (owner: 10Ebrahim)
[21:45:17] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142602 (10MBinder_WMF) Ah, I think I might know the problem: my public key file specifies the name of a computer that preceded my current one. I was pr...
[21:45:35] <wikibugs>	 (03Merged) 10jenkins-bot: Fix night mode excepted Wikidata namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488 (owner: 10Ebrahim)
[21:45:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488 (owner: 10Ebrahim)
[21:45:45] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1072488|Fix night mode excepted Wikidata namespaces]]
[21:46:09] <wikibugs>	 (03CR) 10Ebrahim: "Probably this is a rude way to ask for this so pardon me beforehand... but is it possible to reconsider these meta namespaces dark mode ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim)
[21:46:40] <wikibugs>	 (03PS2) 10Ebrahim: Remove metawiki dark mode exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623
[21:47:23] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Define MW_ENTRY_POINT in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072624 (https://phabricator.wikimedia.org/T374286)
[21:47:42] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup, ebrahim: Backport for [[gerrit:1072488|Fix night mode excepted Wikidata namespaces]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:48:22] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup, ebrahim: Continuing with sync
[21:51:33] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142618 (10Dzahn) Do you have a file /Users/maxbinder/.ssh/id_ed25519.pub  ?  (not /Users/maxbinder/.ssh/id_ed25519 the private part, just the  public p...
[21:51:53] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[21:52:40] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/toolhub: apply
[21:52:54] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072488|Fix night mode excepted Wikidata namespaces]] (duration: 07m 09s)
[21:53:38] <wikibugs>	 (03PS2) 10JHathaway: puppet8: ensure kerberos keytab type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667)
[21:53:41] <wikibugs>	 (03CR) 10Scott French: "If you think using an explicit value here is clearer than "borrowing" the timeout already configured on the stub resolver, I'm happy to re" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French)
[21:53:50] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[21:53:55] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142620 (10Dzahn) Also, try this:  Move the config file out of the way temporarily.  Like `mv  /Users/maxbinder/.ssh/config /Users/maxbinder/` so it doe...
[21:54:00] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/toolhub: apply
[21:54:42] <jinxer-wm>	 FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[21:56:37] <tzatziki>	 !log removing 6 files for legal compliance
[21:56:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:02] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142642 (10Dzahn) You shouldn't have to create a keypair just because your computer name changed. The part at the end is mostly just a comment field.
[21:58:03] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142629 (10MBinder_WMF) >>! In T374582#10142618, @Dzahn wrote: > Do you have a file /Users/maxbinder/.ssh/id_ed25519.pub  ?  (not /Users/maxbinder/.ssh/...
[22:00:12] <tzatziki>	 !log removing 1 file for legal compliance
[22:00:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:31] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142645 (10MBinder_WMF) >>! In T374582#10142620, @Dzahn wrote: > Also, try this: >  > Move the config file out of the way temporarily. >  > Like `mv  /U...
[22:01:25] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142648 (10Ladsgroup) Can you run the ssh command with -vvvvvvvv (the more "v"s, the better)?
[22:01:42] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142649 (10Ladsgroup) but share the result privately, just in case.
[22:02:23] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Switch from ferm to nftables on standby host [puppet] - 10https://gerrit.wikimedia.org/r/1072551
[22:02:39] <wikibugs>	 (03PS3) 10EoghanGaffney: lists: Switch from ferm to nftables on standby host [puppet] - 10https://gerrit.wikimedia.org/r/1072551
[22:02:44] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142650 (10Dzahn) Ok, do this:  `ssh-add /Users/maxbinder/.ssh/id_ed25519`  It should just ask for a passphrase. If you know it, enter it.  Now that key...
[22:02:54] <wikibugs>	 (03PS3) 10Ebrahim: Remove ProofreadPage exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600
[22:03:21] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142651 (10MBinder_WMF) >>! In T374582#10142649, @Ladsgroup wrote: > but share the result privately, just in case.  doc updated with many v's :)
[22:04:49] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142652 (10MBinder_WMF) >>! In T374582#10142650, @Dzahn wrote: > Ok, do this: >  > `ssh-add /Users/maxbinder/.ssh/id_ed25519` >  > It should just ask fo...
[22:05:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10142655 (10phaultfinder)
[22:05:55] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142656 (10MBinder_WMF) I can successfully log on to phab1004.eqiad.wmnet as well. What was the issue?
[22:06:13] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142658 (10Dzahn) >>! In T374582#10142645, @MBinder_WMF wrote: > Output still asked for passphrase:  So that is the thing, the passphrase is needed to d...
[22:08:02] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142661 (10Ladsgroup) FWIW, `ssh -vvvvvvvvvvvvvvvvvvvv ~/.ssh/id_ed25519 mbinder@bast1003.wikimedia.org` broke because: ` ssh: Could not resolve hostnam...
[22:08:18] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142662 (10Dzahn) >>! In T374582#10142656, @MBinder_WMF wrote: > I can successfully log on to phab1004.eqiad.wmnet as well. What was the issue?  The key...
[22:11:10] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[22:11:58] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[22:12:40] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Switch from ferm to nftables on standby host [puppet] - 10https://gerrit.wikimedia.org/r/1072551 (owner: 10EoghanGaffney)
[22:13:37] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review, Hugh!" [puppet] - 10https://gerrit.wikimedia.org/r/1072282 (https://phabricator.wikimedia.org/T374502) (owner: 10Scott French)
[22:14:02] <wikibugs>	 (03PS3) 10JHathaway: puppet8: ensure kerberos keytab type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667)
[22:14:14] <wikibugs>	 (03Abandoned) 10Scott French: aptrepo: ffmpeg bullseye component [puppet] - 10https://gerrit.wikimedia.org/r/1072282 (https://phabricator.wikimedia.org/T374502) (owner: 10Scott French)
[22:15:10] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142682 (10MBinder_WMF) Hmm, I'm pretty sure I never had to enter a passphrase for each login in the past, but I might be mistaken. Also, when I was pro...
[22:16:14] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142711 (10Dzahn) This probably has to do with getting your new computer. Likely you had this key added to some kind of key chain or app provided by the...
[22:18:09] <wikibugs>	 (03CR) 10Jdlrobson: "I would suggest following https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests#Lifecycle_of_a_request and asking the community. If " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim)
[22:18:26] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142727 (10MBinder_WMF) >>! In T374582#10142711, @Dzahn wrote: > This probably has to do with getting your new computer. Likely you had this key added t...
[22:19:57] <icinga-wm>	 PROBLEM - Host lists2001 is DOWN: PING CRITICAL - Packet loss = 100%
[22:21:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[22:21:19] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[22:21:29] <icinga-wm>	 RECOVERY - Host lists2001 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms
[22:23:09] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[22:23:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[22:24:35] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142747 (10MBinder_WMF) Ah, you know what? I think it did, in fact, work. I just didn't realize that I needed to enter it twice, and assumed that the re...
[22:24:55] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[22:25:11] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] ci: define statsd prometheus exporter mappings for zuul [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite)
[22:26:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[22:28:34] <logmsgbot>	 !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@6e810dc] (releasing): (no justification provided)
[22:29:01] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142750 (10Ladsgroup) 05Open→03Resolved a:03MBinder_WMF Don't know Mac but in Linux you can set it to "Remember the key passphrase" and it wou...
[22:30:17] <logmsgbot>	 !log dduvall@deploy1003 deploy aborted: (no justification provided) (duration: 01m 43s)
[22:31:50] <jinxer-wm>	 FIRING: ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:32:41] <icinga-wm>	 PROBLEM - jenkins_service_running on releases1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[22:32:42] <dduvall>	 ^ sorry, that's me. fixing
[22:33:53] <logmsgbot>	 !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@6e810dc] (releasing): (no justification provided)
[22:34:15] <wikibugs>	 (03PS5) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[22:34:21] <wikibugs>	 (03CR) 10BCornwall: "Thanks for all the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[22:34:21] <wikibugs>	 (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (CORE_DIFF 24 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[22:34:28] <logmsgbot>	 !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@6e810dc] (releasing): (no justification provided) (duration: 00m 34s)
[22:34:41] <icinga-wm>	 RECOVERY - jenkins_service_running on releases1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[22:36:46] <wikibugs>	 (03PS1) 10JHathaway: fix rich data keys [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072628 (https://phabricator.wikimedia.org/T372667)
[22:36:47] <wikibugs>	 (03PS1) 10JHathaway: bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072629
[22:36:50] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:41:03] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis - https://phabricator.wikimedia.org/T374673#10142805 (10Dzahn)
[22:42:05] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis - https://phabricator.wikimedia.org/T374673#10142806 (10Dzahn) Thanks for this @KFrancis    Tagged it and keeping it open right now to make people aware handling requests in this time.  Enjoy vacation!
[22:46:30] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] fix rich data keys [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072628 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[22:46:38] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072629 (owner: 10JHathaway)
[22:46:53] <wikibugs>	 (03CR) 10JHathaway: [V:03+2 C:03+2] bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072629 (owner: 10JHathaway)
[22:49:07] <wikibugs>	 (03PS1) 10JHathaway: pcc: bump version on workers, again :( [puppet] - 10https://gerrit.wikimedia.org/r/1072631 (https://phabricator.wikimedia.org/T372667)
[22:50:41] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] pcc: bump version on workers, again :( [puppet] - 10https://gerrit.wikimedia.org/r/1072631 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[22:59:55] <logmsgbot>	 !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@35befba] (releasing): (no justification provided)
[23:00:34] <logmsgbot>	 !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@35befba] (releasing): (no justification provided) (duration: 00m 38s)
[23:03:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:19:30] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Improve $wgFooterIcons override, remove $wmgWikimediaIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712
[23:20:52] <wikibugs>	 (03PS1) 10Cwhite: zuul: set statsd-exporter to relay to local statsite instance [puppet] - 10https://gerrit.wikimedia.org/r/1072632 (https://phabricator.wikimedia.org/T233089)
[23:20:54] <wikibugs>	 (03PS1) 10Cwhite: zuul: send stats to prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1072633 (https://phabricator.wikimedia.org/T233089)
[23:21:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zuul: set statsd-exporter to relay to local statsite instance [puppet] - 10https://gerrit.wikimedia.org/r/1072632 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite)
[23:23:35] <wikibugs>	 (03PS2) 10Cwhite: zuul: set statsd-exporter to relay to local statsite instance [puppet] - 10https://gerrit.wikimedia.org/r/1072632 (https://phabricator.wikimedia.org/T233089)
[23:23:35] <wikibugs>	 (03PS2) 10Cwhite: zuul: send stats to prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1072633 (https://phabricator.wikimedia.org/T233089)
[23:27:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#10142869 (10Papaul) 05Open→03Resolved a:03Papaul Since we know now what the issue is and we have a fix I am closing this task but feel free to...
[23:27:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10142876 (10Papaul) a:03Papaul
[23:28:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10142877 (10Papaul) a:03Papaul
[23:31:34] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:36:34] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:37:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:37:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:38:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072635
[23:38:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072635 (owner: 10TrainBranchBot)
[23:46:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:47:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:48:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed