[00:01:59] (03CR) 10Andrea Denisse: alert: Failover from alert1001 to alert2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [00:03:56] PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:07:25] FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on parse2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:11:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072316 (owner: 10TrainBranchBot) [00:12:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:12:27] (03PS1) 10Dzahn: gerrit: add gerrit::proxy profile to insetup::gerrit role [puppet] - 10https://gerrit.wikimedia.org/r/1072323 (https://phabricator.wikimedia.org/T372804) [00:12:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T371742)', diff saved to https://phabricator.wikimedia.org/P68992 and previous config saved to /var/cache/conftool/dbconfig/20240912-001246-ladsgroup.json [00:12:51] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:14:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:14:12] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:16:39] (03PS1) 10Ladsgroup: tables-catalog: Add more extension tables [puppet] - 10https://gerrit.wikimedia.org/r/1072324 (https://phabricator.wikimedia.org/T363581) [00:16:58] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:17:04] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:17:23] (03CR) 10Scott French: "Thanks, Luca!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [00:17:28] (03PS4) 10Andrea Denisse: alert: Failover from alert1001 to alert2002 [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) [00:20:29] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add more extension tables [puppet] - 10https://gerrit.wikimedia.org/r/1072324 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [00:20:41] (03CR) 10Pppery: "Some of these sites would ideally point to more specific domains like mediawiki.wiki -> mediawiki.org rather than wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [00:22:07] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1072323/3960/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1072323 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [00:22:25] FIRING: [3x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:12] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:27:04] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:27:04] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:27:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:27:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P68993 and previous config saved to /var/cache/conftool/dbconfig/20240912-002753-ladsgroup.json [00:28:36] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [00:29:19] (03CR) 10Andrea Denisse: alert: Failover from alert1001 to alert2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [00:29:59] (03CR) 10Dzahn: "agree with Pppery, see examples as inline comments. What's a good way to crowdsource the redirect mappings here? Can we upload manual patc" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [00:30:07] puppetservers are not happy [00:30:14] and my puppet merge is stuck [00:31:10] PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:32:00] RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:32:12] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:32:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:33:52] (03PS6) 10Andrea Denisse: alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) [00:35:12] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:35:46] PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:36:03] Amir1: normal puppet-merge is still on puppetmaster, not puppetserver [00:36:12] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:36:29] I did that [00:36:41] I had no problems there and it seems merged now [00:36:42] but it got stuck half way through sync [00:36:58] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:37:04] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:37:04] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:37:04] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:37:07] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [00:37:10] well, this looks like it's fixing itself [00:37:12] PROBLEM - Hadoop NodeManager on an-worker1170 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:37:12] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:37:18] https://www.irccloud.com/pastebin/UUbLu1uz/ [00:37:20] puppetserver1001 was busy but not down [00:37:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:25] FIRING: [2x] SystemdUnitFailed: rsyslog-imfile-remedy.service on parse2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:29] the other alerts were all just about syncing to 1001 [00:37:45] (03CR) 10Krinkle: logging: Fix local variables leaking into global scope (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński) [00:37:47] I pasted what happened [00:38:02] PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:38:11] if it's fixing itself, I have no complaints :D [00:38:46] RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:39:12] PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:39:16] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [00:39:28] (03PS1) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1072326 (https://phabricator.wikimedia.org/T372418) [00:39:37] (03PS2) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1072326 (https://phabricator.wikimedia.org/T372418) [00:40:12] Amir1: puppet works on puppetserver1001, wfm [00:40:13] https://wikitech.wikimedia.org/wiki/Puppet#puppet-merge_fails_to_sync_on_secondary [00:40:14] now [00:40:26] funnily I wanted to try this, it didn't let me [00:41:02] PROBLEM - Hadoop NodeManager on an-worker1173 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:41:06] anywayyy [00:41:12] I call it a "day" [00:41:38] yea, sounds like night :P [00:42:25] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P68994 and previous config saved to /var/cache/conftool/dbconfig/20240912-004301-ladsgroup.json [00:43:12] RECOVERY - Hadoop NodeManager on an-worker1170 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:43:28] PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:44:02] RECOVERY - Hadoop NodeManager on an-worker1173 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:45:12] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:45:46] PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:47:25] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:51:10] RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:51:12] RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:51:58] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:53:46] RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:54:16] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 7447 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [00:57:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10139467 (10Papaul) Some notes here: I checked console redirect, it was working for me and the issue i found was th... [00:58:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T371742)', diff saved to https://phabricator.wikimedia.org/P68995 and previous config saved to /var/cache/conftool/dbconfig/20240912-005808-ladsgroup.json [00:58:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:58:12] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:58:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:58:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T371742)', diff saved to https://phabricator.wikimedia.org/P68996 and previous config saved to /var/cache/conftool/dbconfig/20240912-005830-ladsgroup.json [00:59:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10139485 (10Papaul) @elukey for console redirect to work on sretest2001 below are the settings. Thanks let me know if you have any questions {F57501123} {F57501126} [01:00:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139490 (10phaultfinder) [01:01:28] RECOVERY - Hadoop NodeManager on an-worker1146 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:02:12] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:03:02] RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:18:43] (03PS1) 10Bartosz Dziewoński: logging: Fix WikimediaDebug "Verbose logging" option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) [01:19:04] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:19:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) (owner: 10Bartosz Dziewoński) [01:27:04] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:34:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139515 (10phaultfinder) [01:39:04] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:39:46] PROBLEM - dump of s8 in eqiad on backupmon1001 is CRITICAL: dump for s8 at eqiad (db1171) taken more than a week ago: Most recent backup 2024-09-03 01:27:35 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:52:06] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:54:36] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:00:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T371742)', diff saved to https://phabricator.wikimedia.org/P68998 and previous config saved to /var/cache/conftool/dbconfig/20240912-020050-ladsgroup.json [02:00:58] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [02:15:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P68999 and previous config saved to /var/cache/conftool/dbconfig/20240912-021557-ladsgroup.json [02:19:36] RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:20:58] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:23:30] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:23:30] PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:24:30] PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:29:08] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:29:12] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139540 (10phaultfinder) [02:30:10] PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:30:28] PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:30:32] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:31:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P69000 and previous config saved to /var/cache/conftool/dbconfig/20240912-023105-ladsgroup.json [02:31:08] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:32:32] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:35:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [02:36:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:30] RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:36:30] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:36:46] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:37:14] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:38:30] RECOVERY - Hadoop NodeManager on an-worker1132 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:38:30] RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:39:28] PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:40:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [02:42:12] FIRING: [3x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [02:42:14] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:43:10] RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:44:46] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:46:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T371742)', diff saved to https://phabricator.wikimedia.org/P69001 and previous config saved to /var/cache/conftool/dbconfig/20240912-024612-ladsgroup.json [02:46:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [02:46:16] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [02:46:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [02:46:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T371742)', diff saved to https://phabricator.wikimedia.org/P69002 and previous config saved to /var/cache/conftool/dbconfig/20240912-024635-ladsgroup.json [02:50:10] PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:53:10] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:54:10] PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:54:14] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:10] RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:02:25] FIRING: [2x] SystemdUnitFailed: rsyslog-imfile-remedy.service on parse2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:28] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:14:12] RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:16:10] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:31:46] PROBLEM - Hadoop NodeManager on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:40:46] RECOVERY - Hadoop NodeManager on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:42:31] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10139563 (10Papaul) >>! In T371434#10120335, @cmooney wrote: >>>! In T371434#10119784, @Papaul wrote: >> The diagram below will outline the cabling of... [03:51:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T371742)', diff saved to https://phabricator.wikimedia.org/P69003 and previous config saved to /var/cache/conftool/dbconfig/20240912-035105-ladsgroup.json [03:51:09] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:55:47] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587 (10Papaul) 03NEW [03:55:55] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10139580 (10Papaul) p:05Triage→03Medium [03:56:49] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10139581 (10Papaul) [04:06:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P69004 and previous config saved to /var/cache/conftool/dbconfig/20240912-040613-ladsgroup.json [04:08:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [04:13:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [04:13:37] (03CR) 10Pppery: "The log action for marking a page for translation is "pagetranslation", not "translationreview"." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 (owner: 10Jforrester) [04:15:17] (03CR) 10Pppery: "Like the idea, though - translation administration is a tedious, oftentimes underappreciated task that can easily get very backlogged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 (owner: 10Jforrester) [04:21:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P69005 and previous config saved to /var/cache/conftool/dbconfig/20240912-042121-ladsgroup.json [04:24:12] (03PS17) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) [04:24:45] (03CR) 10Ebrahim: "Added MediaWiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [04:27:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:27:34] (03CR) 10Ebrahim: "Added MediaWiki wiki back, please review the change if possible, thank you very much" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [04:36:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T371742)', diff saved to https://phabricator.wikimedia.org/P69006 and previous config saved to /var/cache/conftool/dbconfig/20240912-043628-ladsgroup.json [04:36:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance [04:36:34] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [04:36:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2188.codfw.wmnet with reason: Maintenance [04:37:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T371742)', diff saved to https://phabricator.wikimedia.org/P69007 and previous config saved to /var/cache/conftool/dbconfig/20240912-043701-ladsgroup.json [04:40:04] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:40:52] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 217, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:43:14] PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:51:58] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:55:14] RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:59:04] (03CR) 10Ebrahim: Enable the dark mode in Portal namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [05:23:49] (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1072216 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [05:28:42] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10139647 (10ABran-WMF) Thanks for the dig! Indeed hardware error was misleading, will reimage the server and will let you know soon. [05:31:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T371742)', diff saved to https://phabricator.wikimedia.org/P69008 and previous config saved to /var/cache/conftool/dbconfig/20240912-053116-ladsgroup.json [05:31:24] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [05:44:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [05:46:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P69009 and previous config saved to /var/cache/conftool/dbconfig/20240912-054624-ladsgroup.json [05:49:35] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10139703 (10ABran-WMF) a:05VRiley-WMF→03ABran-WMF [05:50:46] RECOVERY - SSH on db1246 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:52:12] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [05:54:52] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2038 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1072337 (https://phabricator.wikimedia.org/T374592) [05:58:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es7 T374592 [05:58:25] T374592: Switchover es7 master (es2039 -> es2038) - https://phabricator.wikimedia.org/T374592 [05:58:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T374592 [05:59:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set es2038 with weight 0 T374592', diff saved to https://phabricator.wikimedia.org/P69010 and previous config saved to /var/cache/conftool/dbconfig/20240912-055903-arnaudb.json [06:00:19] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote es2038 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1072337 (https://phabricator.wikimedia.org/T374592) (owner: 10Gerrit maintenance bot) [06:01:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P69011 and previous config saved to /var/cache/conftool/dbconfig/20240912-060131-ladsgroup.json [06:02:25] !log Starting es7 codfw failover from es2039 to es2038 - T374592 [06:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote es2038 to es7 primary and set section read-write T374592', diff saved to https://phabricator.wikimedia.org/P69012 and previous config saved to /var/cache/conftool/dbconfig/20240912-060308-arnaudb.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374592', diff saved to https://phabricator.wikimedia.org/P69013 and previous config saved to /var/cache/conftool/dbconfig/20240912-060550-arnaudb.json [06:05:54] T374592: Switchover es7 master (es2039 -> es2038) - https://phabricator.wikimedia.org/T374592 [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10139747 (10ABran-WMF) ES replication source in the path has been moved (T374592), all remaining hosts are depoolable [06:16:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T371742)', diff saved to https://phabricator.wikimedia.org/P69014 and previous config saved to /var/cache/conftool/dbconfig/20240912-061639-ladsgroup.json [06:16:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2202.codfw.wmnet with reason: Maintenance [06:16:43] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:16:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2202.codfw.wmnet with reason: Maintenance [06:19:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s3 T374421 [06:19:16] T374421: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T374421 [06:19:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T374421 [06:20:58] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:27:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1072311 (https://phabricator.wikimedia.org/T374386) (owner: 10Ladsgroup) [06:33:19] !log evacuating leadership for all partitions assigned to broker id 2004 on kafka-main-codfw - T363210 [06:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:23] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [06:34:18] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[2004,2009].codfw.wmnet with reason: Hardware refresh [06:34:33] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[2004,2009].codfw.wmnet with reason: Hardware refresh [06:37:07] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:15] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1072313 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [06:44:00] (03CR) 10JMeybohm: [C:03+2] Decom kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/1072219 (https://phabricator.wikimedia.org/T374542) (owner: 10JMeybohm) [06:45:28] !log installing glibc bugfix updates from bookworm 12.7 point release [06:48:43] (03PS1) 10JMeybohm: kafka-main: Replace kafka-main2004 with kafka-main2009 [puppet] - 10https://gerrit.wikimedia.org/r/1072441 (https://phabricator.wikimedia.org/T363210) [06:55:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: provisionning db2229.codfw.wmnet - T373579 [06:55:26] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [06:55:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: provisionning db2229.codfw.wmnet - T373579 [06:55:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2229.codfw.wmnet with reason: provisionning db2229.codfw.wmnet - T373579 [06:55:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2229.codfw.wmnet with reason: provisionning db2229.codfw.wmnet - T373579 [06:56:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2129 in db2229 for T373579', diff saved to https://phabricator.wikimedia.org/P69015 and previous config saved to /var/cache/conftool/dbconfig/20240912-065641-arnaudb.json [06:58:48] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2129.codfw.wmnet onto db2229.codfw.wmnet [07:00:04] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:02:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:04:14] (03CR) 10JMeybohm: [C:03+2] kafka-main: Replace kafka-main2004 with kafka-main2009 [puppet] - 10https://gerrit.wikimedia.org/r/1072441 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [07:07:57] (03PS1) 10Slyngshede: data.yaml: Offboarding sandeeps [puppet] - 10https://gerrit.wikimedia.org/r/1072443 [07:08:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:09:46] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw [07:10:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance [07:10:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance [07:10:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T371742)', diff saved to https://phabricator.wikimedia.org/P69016 and previous config saved to /var/cache/conftool/dbconfig/20240912-071034-ladsgroup.json [07:10:39] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [07:11:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1072443 (owner: 10Slyngshede) [07:13:34] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding sandeeps [puppet] - 10https://gerrit.wikimedia.org/r/1072443 (owner: 10Slyngshede) [07:18:08] !log slyngshede@cumin1002 START - Cookbook sre.idm.logout Logging Sandeeps out of all services on: 2298 hosts [07:18:43] (03PS1) 10JMeybohm: Replace kafka-main2004 with kafka-main2009 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072472 (https://phabricator.wikimedia.org/T363210) [07:18:53] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Sandeeps out of all services on: 2298 hosts [07:19:14] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:19:21] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [07:19:23] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [07:19:46] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [07:19:47] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [07:19:59] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [07:20:00] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:20:31] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:20:33] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [07:20:46] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [07:20:48] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [07:21:22] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [07:21:23] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [07:21:57] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [07:21:58] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:22:07] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2004.codfw.wmnet - https://phabricator.wikimedia.org/T374594#10139799 (10JMeybohm) [07:22:11] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:22:12] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [07:22:22] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:22:37] (03PS1) 10Slyngshede: data.yaml: Offboarding MNadrofsky [puppet] - 10https://gerrit.wikimedia.org/r/1072478 [07:24:51] (03CR) 10Slyngshede: "User only appears as a name and is nowhere to be found in LDAP." [puppet] - 10https://gerrit.wikimedia.org/r/1072478 (owner: 10Slyngshede) [07:26:21] (03PS1) 10JMeybohm: Decom kafka-main2004 [puppet] - 10https://gerrit.wikimedia.org/r/1072479 (https://phabricator.wikimedia.org/T374594) [07:28:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw [07:31:04] (03CR) 10Muehlenhoff: [C:03+1] "Ack. He had access to some procurement ACL in Phab, I had removed that earlier the morning." [puppet] - 10https://gerrit.wikimedia.org/r/1072478 (owner: 10Slyngshede) [07:34:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2129.codfw.wmnet onto db2229.codfw.wmnet [07:34:49] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding MNadrofsky [puppet] - 10https://gerrit.wikimedia.org/r/1072478 (owner: 10Slyngshede) [07:36:51] (03PS1) 10Muehlenhoff: lists: Enable profile::auto_restarts::service for spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1072482 (https://phabricator.wikimedia.org/T135991) [07:37:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69017 and previous config saved to /var/cache/conftool/dbconfig/20240912-073744-arnaudb.json [07:38:56] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [07:39:00] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [07:46:39] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [07:46:42] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [07:52:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 2%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69018 and previous config saved to /var/cache/conftool/dbconfig/20240912-075250-arnaudb.json [07:58:30] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [07:58:35] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:02:45] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 (owner: 10Slyngshede) [08:04:39] (03CR) 10Ebrahim: Enable the dark mode in Portal namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [08:06:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T371742)', diff saved to https://phabricator.wikimedia.org/P69019 and previous config saved to /var/cache/conftool/dbconfig/20240912-080647-ladsgroup.json [08:06:48] (03PS18) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) [08:06:51] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [08:07:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 3%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69020 and previous config saved to /var/cache/conftool/dbconfig/20240912-080756-arnaudb.json [08:09:50] (03PS19) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) [08:15:47] (03PS1) 10Fabfur: hiera: continue haproxykafka tests on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1072484 (https://phabricator.wikimedia.org/T370668) [08:16:13] (03PS1) 10Gmodena: mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) [08:21:43] (03PS1) 10Ebrahim: Make LiquidThreads related dark mode namespace exceptions explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487 [08:21:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P69021 and previous config saved to /var/cache/conftool/dbconfig/20240912-082154-ladsgroup.json [08:23:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 4%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69022 and previous config saved to /var/cache/conftool/dbconfig/20240912-082301-arnaudb.json [08:23:33] (03PS2) 10Ebrahim: Make LQT dark mode exceptions explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487 [08:25:25] (03PS6) 10Slyngshede: PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 [08:27:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:50] (03CR) 10Slyngshede: PermissionRequest validation. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 (owner: 10Slyngshede) [08:27:58] (03CR) 10Slyngshede: [C:03+2] PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 (owner: 10Slyngshede) [08:29:47] (03PS1) 10Ebrahim: Fix night mode excepted Wikidata namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488 [08:30:15] (03Merged) 10jenkins-bot: PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 (owner: 10Slyngshede) [08:31:40] (03PS20) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) [08:33:22] (03PS3) 10Ebrahim: Make LQT dark mode exceptions explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487 [08:33:57] (03CR) 10Muehlenhoff: [C:03+2] Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [08:34:14] (03PS4) 10Ebrahim: Make LQT night mode exceptions explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487 [08:34:46] Amir1, mutante - o/ re:puppetserver1001, we had a little outage while Amir merged (see https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=puppetserver1001&var-datasource=thanos&var-cluster=misc&from=1726091490618&to=1726107388975) - related to https://phabricator.wikimedia.org/T373527, I'll update the task [08:35:41] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:35:53] (03PS2) 10Slyngshede: Redesign menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038608 [08:36:37] (03PS2) 10JMeybohm: mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) (owner: 10Gmodena) [08:37:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P69023 and previous config saved to /var/cache/conftool/dbconfig/20240912-083701-ladsgroup.json [08:37:06] (03CR) 10JMeybohm: [C:03+1] mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) (owner: 10Gmodena) [08:38:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69024 and previous config saved to /var/cache/conftool/dbconfig/20240912-083807-arnaudb.json [08:40:43] (03PS3) 10Slyngshede: P:idp_test: Enable permission requests on testing. [puppet] - 10https://gerrit.wikimedia.org/r/1072107 [08:42:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:43:53] (03CR) 10Slyngshede: [C:03+2] P:idp_test: Enable permission requests on testing. [puppet] - 10https://gerrit.wikimedia.org/r/1072107 (owner: 10Slyngshede) [08:44:04] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:45:14] (03CR) 10Filippo Giunchedi: [C:03+2] logging: add script to query for orphan traces [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi) [08:45:17] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139873 (10phaultfinder) [08:46:15] (03PS3) 10JMeybohm: mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) (owner: 10Gmodena) [08:46:15] (03PS2) 10JMeybohm: Replace kafka-main2004 with kafka-main2009 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072472 (https://phabricator.wikimedia.org/T363210) [08:47:02] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main2009.codfw.wmnet [08:47:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2009.codfw.wmnet [08:48:10] (03PS3) 10Filippo Giunchedi: logging: add script to query for orphan traces [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) [08:48:29] (03CR) 10Filippo Giunchedi: logging: add script to query for orphan traces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi) [08:48:37] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] logging: add script to query for orphan traces [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi) [08:49:25] !log restoring leadership for all partitions assigned to broker id 2004 on kafka-main-codfw - T363210 [08:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:28] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [08:50:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:51:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10139881 (10elukey) Thanks! I created a diff from the settings dumped before your fix(es) and after, from the Redfish point of view. ` Diff for BootModeSelect: before L... [08:51:36] (03PS1) 10Filippo Giunchedi: jaeger: fix typo ensure vs require [puppet] - 10https://gerrit.wikimedia.org/r/1072492 [08:51:58] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:52:07] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: fix typo ensure vs require [puppet] - 10https://gerrit.wikimedia.org/r/1072492 (owner: 10Filippo Giunchedi) [08:52:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T371742)', diff saved to https://phabricator.wikimedia.org/P69025 and previous config saved to /var/cache/conftool/dbconfig/20240912-085209-ladsgroup.json [08:52:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance [08:52:13] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [08:52:20] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1071811 (https://phabricator.wikimedia.org/T374421) (owner: 10Gerrit maintenance bot) [08:52:24] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [08:52:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2216.codfw.wmnet with reason: Maintenance [08:52:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T371742)', diff saved to https://phabricator.wikimedia.org/P69026 and previous config saved to /var/cache/conftool/dbconfig/20240912-085232-ladsgroup.json [08:53:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69027 and previous config saved to /var/cache/conftool/dbconfig/20240912-085312-arnaudb.json [08:54:18] !log Starting s3 codfw failover from db2209 to db2205 - T374421 [08:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:21] T374421: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T374421 [08:56:01] (03PS1) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) [08:57:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:58:09] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072472 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [08:58:22] (03CR) 10CI reject: [V:04-1] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [08:58:48] (03CR) 10JMeybohm: [C:03+2] Replace kafka-main2004 with kafka-main2009 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072472 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [08:58:52] (03CR) 10JMeybohm: [C:03+2] mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) (owner: 10Gmodena) [08:59:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2205 to s3 primary T374421', diff saved to https://phabricator.wikimedia.org/P69028 and previous config saved to /var/cache/conftool/dbconfig/20240912-085859-arnaudb.json [08:59:41] (03PS3) 10Elukey: sre.hosts.provision: improve Supermicro's bios settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) [08:59:47] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: fix kafka values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072485 (https://phabricator.wikimedia.org/T363210) (owner: 10Gmodena) [09:00:25] (03Merged) 10jenkins-bot: Replace kafka-main2004 with kafka-main2009 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072472 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [09:00:56] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:00:57] (03CR) 10Elukey: sre.hosts.provision: improve Supermicro's bios settings (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:01:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374421', diff saved to https://phabricator.wikimedia.org/P69029 and previous config saved to /var/cache/conftool/dbconfig/20240912-090157-arnaudb.json [09:02:00] T374421: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T374421 [09:03:59] (03PS2) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) [09:04:59] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10139909 (10ABran-WMF) >>! In T374523#10136865, @cmooney wrote: >>>! In T374523#10136856, @ABran-WMF wrote: >> I'll get to T374425 to get to T374421 and unblo... [09:06:27] (03CR) 10CI reject: [V:04-1] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [09:06:52] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:07:28] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:08:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 15%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69030 and previous config saved to /var/cache/conftool/dbconfig/20240912-090818-arnaudb.json [09:12:12] (03PS4) 10Elukey: sre.hosts.provision: improve Supermicro's bios settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) [09:12:46] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:13:41] (03CR) 10Filippo Giunchedi: [C:03+1] alert: Failover from alert1001 to alert2002 [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [09:14:01] (03PS3) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) [09:14:02] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10139921 (10cmooney) >>! In T374523#10139909, @ABran-WMF wrote: > We can add it to today's maintenance if you're up to it. Let me know so I can add it to the... [09:14:04] (03CR) 10Filippo Giunchedi: [C:03+1] alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1072326 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [09:14:08] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kafka-main2004.codfw.wmnet [09:14:33] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10139922 (10ABran-WMF) ack, adding it to the pile [09:15:43] (03CR) 10Effie Mouzeli: "wait for it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 (owner: 10Effie Mouzeli) [09:16:07] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:16:24] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1072479 (https://phabricator.wikimedia.org/T374594) (owner: 10JMeybohm) [09:16:42] 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10139923 (10elukey) It happened again, this time to puppetserver1001. Amir was in the middle of a puppet-merge and it got stuck. OOM killer acting on the puppetser... [09:16:47] (03CR) 10CI reject: [V:04-1] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [09:17:21] (03CR) 10Elukey: Swap poolcounter2003 with poolcounter2005 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [09:19:05] (03PS4) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) [09:19:44] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [09:21:22] (03PS1) 10Effie Mouzeli: app.job: update to job 3.0.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072500 [09:22:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [09:22:56] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [09:23:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [09:23:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:23:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-main2004.codfw.wmnet [09:23:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69031 and previous config saved to /var/cache/conftool/dbconfig/20240912-092324-arnaudb.json [09:23:27] (03CR) 10JMeybohm: [C:03+2] Decom kafka-main2004 [puppet] - 10https://gerrit.wikimedia.org/r/1072479 (https://phabricator.wikimedia.org/T374594) (owner: 10JMeybohm) [09:24:57] (03PS1) 10Elukey: services: switch thumbor in codfw to poolcounter2005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015) [09:25:27] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission kafka-main2004.codfw.wmnet - https://phabricator.wikimedia.org/T374594#10139958 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kafka-main2004.codfw.wmnet` - kafka-main2004.codf... [09:26:05] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1072482 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:26:30] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission kafka-main2004.codfw.wmnet - https://phabricator.wikimedia.org/T374594#10139963 (10JMeybohm) a:05JMeybohm→03None [09:27:29] (03CR) 10Elukey: [C:04-1] "Needs to be tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [09:28:25] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [09:28:59] (03CR) 10EoghanGaffney: [C:03+1] lists: Enable profile::auto_restarts::service for spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1072482 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:31:16] (03PS2) 10Elukey: services: switch thumbor in codfw to poolcounter2005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015) [09:31:27] (03CR) 10Filippo Giunchedi: [C:04-1] "Change LGTM, though there are more users of check_ntp_peer" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [09:31:34] (03CR) 10Elukey: "Hugh: Lemme know what you think about it :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [09:32:29] (03CR) 10Filippo Giunchedi: [C:03+1] mysql: replication lag monitoring threshold and severity change [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [09:32:37] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [09:32:59] (03CR) 10Arnaudb: [C:03+2] mysql: replication lag monitoring threshold and severity change [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [09:33:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:34:37] (03Merged) 10jenkins-bot: mysql: replication lag monitoring threshold and severity change [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [09:35:18] (03PS1) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 [09:35:27] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [09:36:38] (03PS2) 10Effie Mouzeli: app.job: update to job 3.0.0 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072500 [09:36:46] (03PS2) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 [09:37:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10140011 (10elukey) Updated https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1071553 and tested, it seems working. I kicked off a reimage of sretest2001, and I en... [09:38:14] (03CR) 10Muehlenhoff: [C:03+2] lists: Enable profile::auto_restarts::service for spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1072482 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:38:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:38:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69032 and previous config saved to /var/cache/conftool/dbconfig/20240912-093829-arnaudb.json [09:38:57] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:39:07] (03CR) 10Hnowlan: [C:03+1] "Sounds good to me!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [09:39:10] !incidents [09:39:10] 5160 (UNACKED) ProbeDown sre (10.2.2.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 eqiad) [09:39:11] 5158 (RESOLVED) NELHigh sre (thanos-rule tcp.address_unreachable) [09:39:13] !ack 5160 [09:39:14] 5160 (ACKED) ProbeDown sre (10.2.2.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 eqiad) [09:39:27] * vgutierrez looking [09:39:31] that one again [09:39:48] worker crunch in eqiad [09:40:11] p99 at 5 minutes, awesome [09:40:14] another crawler? [09:40:18] keeps on giving [09:40:22] high memcache errors, or is that just a consequence? [09:40:37] (03CR) 10Muehlenhoff: "PCC error seems like a temporal glitch in the matrix" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [09:40:42] vgutierrez: no, I bet it's just taking ages to respond, there's like 4rps [09:41:07] https://grafana.wikimedia.org/goto/qmh61Y6SR?orgId=1 [09:41:10] yeah.. no traffic at all [09:41:31] weird, the executor isn't loaded though [09:41:45] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [09:41:53] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [09:43:16] pybal is keeping not healthy realservers pooled [09:43:34] vgutierrez: there's only 2 replicas [09:43:39] so yeah not surprising [09:43:57] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:13] claime: well.. pybal has 210 realservers configured for for wikifunctions [09:44:26] k8s magic :) [09:44:33] vgutierrez: yeah, k8s x) [09:45:56] FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:46:56] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:46:57] (03PS1) 10Brouberol: cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) [09:47:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:47:38] (03CR) 10CI reject: [V:04-1] cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) (owner: 10Brouberol) [09:47:49] I'm guessing that's not related at all to wikifunctions [09:48:00] (03PS1) 10Muehlenhoff: Only run puppetserver spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1072505 [09:48:34] (03CR) 10CI reject: [V:04-1] Only run puppetserver spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1072505 (owner: 10Muehlenhoff) [09:49:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T371742)', diff saved to https://phabricator.wikimedia.org/P69033 and previous config saved to /var/cache/conftool/dbconfig/20240912-094912-ladsgroup.json [09:49:16] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [09:50:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:51:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:52:12] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [09:52:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:52:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:53:00] (03PS2) 10Muehlenhoff: Only run puppetserver spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1072505 [09:53:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:53:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69034 and previous config saved to /var/cache/conftool/dbconfig/20240912-095335-arnaudb.json [09:53:40] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1246.eqiad.wmnet with OS bookworm [09:55:56] (03PS1) 10Clément Goubert: mw-wikifunctions: Raise replicas to 6 per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072508 [09:57:21] Folks myself and Ben are doing a test on cephosd1001 to test failover for the Anycast BGP service on it [09:57:32] we are not downtiming the host so we can observe what alerts trigger - please ignore [09:57:45] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:57:59] (03CR) 10Clément Goubert: [C:03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [09:58:28] (03CR) 10JMeybohm: [C:03+1] mw-wikifunctions: Raise replicas to 6 per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072508 (owner: 10Clément Goubert) [09:58:48] (03CR) 10Hnowlan: [C:03+1] mw-wikifunctions: Raise replicas to 6 per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072508 (owner: 10Clément Goubert) [09:58:54] (03CR) 10Clément Goubert: [C:03+2] mw-wikifunctions: Raise replicas to 6 per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072508 (owner: 10Clément Goubert) [09:59:04] !log stopping envoyproxy on cephosd1001 [09:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:46] (03Merged) 10jenkins-bot: mw-wikifunctions: Raise replicas to 6 per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072508 (owner: 10Clément Goubert) [09:59:53] (03CR) 10Hnowlan: [C:03+1] aptrepo: ffmpeg bullseye component [puppet] - 10https://gerrit.wikimedia.org/r/1072282 (https://phabricator.wikimedia.org/T374502) (owner: 10Scott French) [10:00:12] !log Increasing mw-wikifunctions replicas to 6 [10:00:23] !log restarted envoyproxy on cephosd1001 [10:00:31] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [10:00:46] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [10:01:15] (03PS5) 10Brouberol: cloudnative-pg-cluster: enable wal upload / backups to s3 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072504 (https://phabricator.wikimedia.org/T372281) [10:01:37] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [10:04:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P69035 and previous config saved to /var/cache/conftool/dbconfig/20240912-100419-ladsgroup.json [10:04:38] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [10:04:47] (03PS2) 10EoghanGaffney: lists: Add ATS map for lists.wikimedia.org -> lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1072247 [10:07:18] (03CR) 10Hashar: [C:03+1] logging: Fix WikimediaDebug "Verbose logging" option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) (owner: 10Bartosz Dziewoński) [10:07:37] (03CR) 10Elukey: [C:03+2] services: switch thumbor in codfw to poolcounter2005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072501 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [10:07:38] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [10:07:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [10:08:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [10:08:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T367856)', diff saved to https://phabricator.wikimedia.org/P69036 and previous config saved to /var/cache/conftool/dbconfig/20240912-100811-ladsgroup.json [10:08:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: post db2229 bootstrap', diff saved to https://phabricator.wikimedia.org/P69037 and previous config saved to /var/cache/conftool/dbconfig/20240912-100841-arnaudb.json [10:11:00] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync [10:11:05] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync [10:19:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P69038 and previous config saved to /var/cache/conftool/dbconfig/20240912-101927-ladsgroup.json [10:20:58] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:22:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:28] (03PS3) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 [10:25:35] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for et-0-0-31-100.ssw1-f1-eqiad.eqiad.wmnet - cmooney@cumin1002" [10:25:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for et-0-0-31-100.ssw1-f1-eqiad.eqiad.wmnet - cmooney@cumin1002" [10:25:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:25:43] !log stopping envoyproxy on cephosd1001 [10:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:26] (03CR) 10CI reject: [V:04-1] app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli) [10:31:17] (03PS4) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 [10:32:13] (03CR) 10CI reject: [V:04-1] app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli) [10:32:58] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache 2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.0.e.f.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa. on all recursors [10:33:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.0.e.f.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa. on all recursors [10:34:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T371742)', diff saved to https://phabricator.wikimedia.org/P69039 and previous config saved to /var/cache/conftool/dbconfig/20240912-103434-ladsgroup.json [10:34:38] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:42:09] (03PS5) 10Effie Mouzeli: app.job: update to job 3.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 [10:45:06] (03CR) 10Effie Mouzeli: [C:03+1] shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072246 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [10:46:00] (03CR) 10Clément Goubert: [C:03+1] shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072246 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [10:49:04] (03CR) 10Clément Goubert: [C:03+1] service: add basic configuration for mwdebug-next [puppet] - 10https://gerrit.wikimedia.org/r/1071933 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [10:50:26] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [10:54:31] (03CR) 10Hnowlan: [C:03+2] shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072246 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [10:55:56] (03Merged) 10jenkins-bot: shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072246 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [10:57:51] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: eqiad1: fix instances_ip_ranges parameter [puppet] - 10https://gerrit.wikimedia.org/r/1072513 (https://phabricator.wikimedia.org/T374020) [10:57:58] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072513 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [10:59:15] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [10:59:41] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [11:00:44] (03Abandoned) 10Effie Mouzeli: app.job: update to job 2.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 (owner: 10Effie Mouzeli) [11:02:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:03:05] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [11:03:53] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [11:04:13] (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: eqiad1: fix instances_ip_ranges parameter [puppet] - 10https://gerrit.wikimedia.org/r/1072513 (https://phabricator.wikimedia.org/T374020) [11:04:19] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072513 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [11:05:09] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10140285 (10Vgutierrez) @RobH / @wiki_willy could we get this task prioritized on your side? [11:05:31] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2093.codfw.wmnet [11:05:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2093.codfw.wmnet [11:06:05] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2029.codfw.wmnet [11:06:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2029.codfw.wmnet [11:07:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [11:07:40] jouncebot: nowandnext [11:07:41] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [11:07:41] In 0 hour(s) and 52 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1200) [11:12:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [11:12:10] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: keystone: eqiad1: fix instances_ip_ranges parameter [puppet] - 10https://gerrit.wikimedia.org/r/1072513 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [11:13:24] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [11:14:10] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [11:17:56] 06SRE: Arelion transport to eqsin from codfw maxing out - Sept 12 2024 - https://phabricator.wikimedia.org/T374608 (10cmooney) 03NEW p:05Triage→03High [11:29:18] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:29:47] (03PS1) 10Jcrespo: mariadb: Increase buffer pool for db1171:s8, which is lagging [puppet] - 10https://gerrit.wikimedia.org/r/1072515 (https://phabricator.wikimedia.org/T374610) [11:30:09] 06SRE: Arelion transport to eqsin from codfw maxing out - Sept 12 2024 - https://phabricator.wikimedia.org/T374608#10140350 (10cmooney) Nothing in Superset is jumping out at me. From the netflow's I suspect it may be AS138341 / SHOPEE SINGAPORE PRIVATE LIMITED. The spike in traffic starting yesterday afternoon... [11:31:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1038608 (owner: 10Slyngshede) [11:32:16] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [11:33:48] (03PS2) 10Urbanecm: Babel: Set BabelUseCommunityConfiguration to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071916 (https://phabricator.wikimedia.org/T374611) [11:33:51] (03PS2) 10Urbanecm: [beta] Babel: Use CommunityConfiguration in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071917 (https://phabricator.wikimedia.org/T374611) [11:33:57] jouncebot: nowandnext [11:33:58] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [11:33:58] In 0 hour(s) and 26 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1200) [11:34:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071916 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm) [11:35:01] (03Merged) 10jenkins-bot: Babel: Set BabelUseCommunityConfiguration to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071916 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm) [11:35:28] (03PS3) 10Urbanecm: [beta] Babel: Use CommunityConfiguration in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071917 (https://phabricator.wikimedia.org/T374611) [11:35:34] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1071916|Babel: Set BabelUseCommunityConfiguration to false (T374611)]] [11:35:38] T374611: Switch BabelUseCommunityConfiguration to true on Beta cluster - https://phabricator.wikimedia.org/T374611 [11:36:31] (03PS1) 10Urbanecm: [beta] Babel: Use CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072517 (https://phabricator.wikimedia.org/T374611) [11:38:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [11:38:55] (03CR) 10Jcrespo: [C:03+2] mariadb: Increase buffer pool for db1171:s8, which is lagging [puppet] - 10https://gerrit.wikimedia.org/r/1072515 (https://phabricator.wikimedia.org/T374610) (owner: 10Jcrespo) [11:42:25] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:59] !log restarting db1171:s7 mysql process T374610 [11:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:03] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071916|Babel: Set BabelUseCommunityConfiguration to false (T374611)]] (duration: 11m 28s) [11:47:03] T374610: db1171:s8 is having performance issues and lagging - https://phabricator.wikimedia.org/T374610 [11:47:07] T374611: Switch BabelUseCommunityConfiguration to true on Beta cluster - https://phabricator.wikimedia.org/T374611 [11:47:22] (03CR) 10Urbanecm: [C:03+2] [beta] Babel: Use CommunityConfiguration in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071917 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm) [11:47:25] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:03] (03Merged) 10jenkins-bot: [beta] Babel: Use CommunityConfiguration in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071917 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1200) [12:01:14] (03CR) 10Ladsgroup: "I'd say let's finish prod dbs (and decommission old ones) and then start working on dbproxies. So many in progress stuff is hard for me to" [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb) [12:04:12] (03CR) 10Ladsgroup: "Now we have two candidate masters for s6 in codfw which would break switchmaster tool if you try to use it for s6. We should do something " [puppet] - 10https://gerrit.wikimedia.org/r/1072216 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [12:07:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [12:11:18] 07sre-alert-triage, 10Data-Platform-SRE (2024.09.06 - 2024.09.27): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10140474 (10BTullis) Listed the logical devices: ` btullis@an-worker1085:~$ sudo megacli -LDInfo -Lall -a0|grep Drive: Virtual Drive: 0 (Target Id: 0) Virtual Drive: 1... [12:11:59] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1085.eqiad.wmnet [12:12:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [12:13:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:15:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [12:18:10] (03CR) 10Slyngshede: [C:03+2] Redesign menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038608 (owner: 10Slyngshede) [12:18:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:19:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1085.eqiad.wmnet [12:20:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [12:20:29] (03CR) 10Elukey: "One nit and we are good to go!" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [12:20:30] (03Merged) 10jenkins-bot: Redesign menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038608 (owner: 10Slyngshede) [12:21:48] (03PS1) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 [12:22:18] (03PS2) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 [12:22:20] (03PS1) 10Btullis: Enable the performace CPU governor on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072529 (https://phabricator.wikimedia.org/T365878) [12:23:03] 07sre-alert-triage, 10Data-Platform-SRE (2024.09.06 - 2024.09.27): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10140492 (10BTullis) 05Open→03Resolved [12:23:35] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3964/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072529 (https://phabricator.wikimedia.org/T365878) (owner: 10Btullis) [12:24:40] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2390.codfw.wmnet [12:25:18] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2390.codfw.wmnet [12:25:23] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2394.codfw.wmnet [12:25:38] 06SRE: Arelion transport to eqsin from codfw maxing out - Sept 12 2024 - https://phabricator.wikimedia.org/T374608#10140496 (10cmooney) We added a requestctl rule for IP range 147.136.175.0/24 which has brought usage back within acceptable levels and we no longer see dropped packets on the link. {F57502814 widt... [12:26:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2394.codfw.wmnet [12:26:05] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2395.codfw.wmnet [12:26:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2395.codfw.wmnet [12:26:43] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2396.codfw.wmnet [12:27:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2396.codfw.wmnet [12:27:25] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2397.codfw.wmnet [12:27:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2397.codfw.wmnet [12:28:04] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2398.codfw.wmnet [12:28:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2398.codfw.wmnet [12:28:43] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2399.codfw.wmnet [12:29:04] jouncebot: nowandnext [12:29:04] For the next 0 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1200) [12:29:04] In 0 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1300) [12:29:11] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host deploy2002.codfw.wmnet [12:29:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2399.codfw.wmnet [12:29:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10140498 (10phaultfinder) [12:29:46] (03PS2) 10Jforrester: On wikis with the Translate extension, allow thanking of translationreview log actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 [12:29:51] (03CR) 10Jforrester: "Aha, right, the log group is translationreview but the action is pagetranslation. Meh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 (owner: 10Jforrester) [12:30:48] (03CR) 10Urbanecm: [C:03+2] "beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072517 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm) [12:30:52] (03PS1) 10Alexandros Kosiaris: Rename mw239[0456789] to wikikube-worker21[07-13] [puppet] - 10https://gerrit.wikimedia.org/r/1072532 (https://phabricator.wikimedia.org/T372878) [12:31:14] !log depool mw239[0456789] for re-numbering, renaming and reimaging. [12:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:30] (03Merged) 10jenkins-bot: [beta] Babel: Use CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072517 (https://phabricator.wikimedia.org/T374611) (owner: 10Urbanecm) [12:31:33] (03PS1) 10Muehlenhoff: Switch deploy2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072533 (https://phabricator.wikimedia.org/T349619) [12:34:08] (03CR) 10Muehlenhoff: [C:03+2] Switch deploy2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072533 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:34:45] (03CR) 10Alexandros Kosiaris: [C:03+2] Rename mw239[0456789] to wikikube-worker21[07-13] [puppet] - 10https://gerrit.wikimedia.org/r/1072532 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [12:35:13] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync [12:35:18] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv [12:35:18] e - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:35:30] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv [12:35:30] e - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:36:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2123.codfw.wmnet with reason: Maintenance [12:36:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2123.codfw.wmnet with reason: Maintenance [12:36:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T367781)', diff saved to https://phabricator.wikimedia.org/P69040 and previous config saved to /var/cache/conftool/dbconfig/20240912-123631-arnaudb.json [12:36:32] (03PS1) 10Aklapper: Weekly Phabricator data for Tech News: Make output MediaWiki pastable [puppet] - 10https://gerrit.wikimedia.org/r/1072535 (https://phabricator.wikimedia.org/T373952) [12:36:37] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [12:37:39] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] logging: Fix WikimediaDebug "Verbose logging" option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) (owner: 10Bartosz Dziewoński) [12:38:23] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [12:40:25] 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614 (10cmooney) 03NEW p:05Triage→03Medium [12:41:26] !log thumbor codfw on wikikube moved to poolcounter2005 - T332015 [12:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:29] T332015: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015 [12:42:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on mw2394:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:44:17] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2390 to wikikube-worker2107 [12:44:18] (03PS1) 10Aklapper: Weekly Phabricator data for Tech News: Add Auto-Submitted [puppet] - 10https://gerrit.wikimedia.org/r/1072536 [12:44:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T367781)', diff saved to https://phabricator.wikimedia.org/P69041 and previous config saved to /var/cache/conftool/dbconfig/20240912-124421-arnaudb.json [12:44:26] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [12:44:37] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:45:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2209.codfw.wmnet with reason: Maintenance [12:45:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [12:46:01] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 2 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10140598 (10elukey) 05Resolved→03Open Using this task to create another VM, poolcounter2006. [12:46:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2209.codfw.wmnet with reason: Maintenance [12:46:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2139.codfw.wmnet with reason: Maintenance [12:46:09] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 2 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10140601 (10elukey) [12:46:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2139.codfw.wmnet with reason: Maintenance [12:46:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T370903)', diff saved to https://phabricator.wikimedia.org/P69042 and previous config saved to /var/cache/conftool/dbconfig/20240912-124626-ladsgroup.json [12:46:30] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:47:40] FIRING: [6x] KubernetesRsyslogDown: rsyslog on mw2394:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:47:48] (03PS1) 10Elukey: Add configuration for poolcounter2006 [puppet] - 10https://gerrit.wikimedia.org/r/1072537 (https://phabricator.wikimedia.org/T374520) [12:48:03] MatmaRex, Lucas_WMDE i'd like to try the final bit of the MOS namespace, enwiki, today [12:48:13] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: enable new vxlan-based subnet CIDR in cloudgw and keystone [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020) [12:48:29] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2390 to wikikube-worker2107 - akosiaris@cumin1002" [12:48:37] (03PS2) 10Elukey: Add configuration for poolcounter2006 [puppet] - 10https://gerrit.wikimedia.org/r/1072537 (https://phabricator.wikimedia.org/T374520) [12:49:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2390 to wikikube-worker2107 - akosiaris@cumin1002" [12:49:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:49:03] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2107 [12:49:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2107 [12:49:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2390 to wikikube-worker2107 [12:50:09] (03PS2) 10Arturo Borrero Gonzalez: codfw1dev: enable new vxlan-based subnet CIDR in cloudgw and keystone [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020) [12:50:09] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2390 to wikikube-worker2107 completed: - mw2390 (**... [12:50:15] (03PS4) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) [12:50:18] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2394 to wikikube-worker2108 [12:50:39] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:50:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM. Let's also create it in row B, like poolcounter2004." [puppet] - 10https://gerrit.wikimedia.org/r/1072537 (https://phabricator.wikimedia.org/T374520) (owner: 10Elukey) [12:50:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host deploy2002.codfw.wmnet [12:51:28] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [12:51:37] (03CR) 10Elukey: [C:03+2] Add configuration for poolcounter2006 [puppet] - 10https://gerrit.wikimedia.org/r/1072537 (https://phabricator.wikimedia.org/T374520) (owner: 10Elukey) [12:51:58] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:52:39] (03PS3) 10Arturo Borrero Gonzalez: codfw1dev: enable new vxlan-based subnet CIDR in cloudgw and keystone [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020) [12:52:40] 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#10140638 (10fnegri) [12:53:03] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [12:53:20] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host poolcounter2006.codfw.wmnet [12:53:54] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2394 to wikikube-worker2108 - akosiaris@cumin1002" [12:54:11] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2394 to wikikube-worker2108 - akosiaris@cumin1002" [12:54:11] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:54:12] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [12:54:12] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2108 [12:54:24] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2108 [12:55:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2394 to wikikube-worker2108 [12:55:13] (03CR) 10C. Scott Ananian: "Post-deploy maintenance script commands listed at T363538#10140642" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [12:55:14] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140645 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2394 to wikikube-worker2108 completed: - mw2394 (**... [12:56:55] (03CR) 10Volans: [C:03+1] "LGTM, I trust your testing :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [12:57:09] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2395 to wikikube-worker2109 [12:57:34] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter2006.codfw.wmnet - elukey@cumin1002" [12:58:15] (03PS1) 10Arnaudb: bashrc: add 2 helper function [puppet] - 10https://gerrit.wikimedia.org/r/1072539 [12:58:17] (03CR) 10Arnaudb: [C:03+2] bashrc: add 2 helper function [puppet] - 10https://gerrit.wikimedia.org/r/1072539 (owner: 10Arnaudb) [12:58:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter2006.codfw.wmnet - elukey@cumin1002" [12:58:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:58:45] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache poolcounter2006.codfw.wmnet on all recursors [12:58:45] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:58:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) poolcounter2006.codfw.wmnet on all recursors [12:59:11] (03CR) 10Ssingh: [V:03+1] "Nice catch, thank you! I am going to leave those alone since they are used for the anycast checks and add a new custom command." [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [12:59:15] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter2006.codfw.wmnet - elukey@cumin1002" [12:59:20] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter2006.codfw.wmnet - elukey@cumin1002" [12:59:23] (03PS2) 10Ssingh: P:ntp and nagios_core: update check_ntp_peer to include stratum checks [puppet] - 10https://gerrit.wikimedia.org/r/1072276 [12:59:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P69043 and previous config saved to /var/cache/conftool/dbconfig/20240912-125928-arnaudb.json [12:59:36] (03CR) 10CI reject: [V:04-1] P:ntp and nagios_core: update check_ntp_peer to include stratum checks [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1300). [13:00:05] MatmaRex and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] hi. and hi cscott [13:00:27] * cscott waves [13:00:36] i'm here ready to compete for the t-shirt [13:00:57] (03PS3) 10Ssingh: P:ntp and nagios_core: update check_ntp_peer to include stratum checks [puppet] - 10https://gerrit.wikimedia.org/r/1072276 [13:01:52] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3966/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [13:02:03] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host poolcounter2006.codfw.wmnet with OS bookworm [13:03:12] 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10140660 (10cmooney) [13:03:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1192.eqiad.wmnet with reason: Maintenance [13:03:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1192.eqiad.wmnet with reason: Maintenance [13:03:58] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2395 to wikikube-worker2109 - akosiaris@cumin1002" [13:04:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T367781)', diff saved to https://phabricator.wikimedia.org/P69044 and previous config saved to /var/cache/conftool/dbconfig/20240912-130400-arnaudb.json [13:04:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2395 to wikikube-worker2109 - akosiaris@cumin1002" [13:04:03] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:04:03] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2109 [13:04:04] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:04:09] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2107.codfw.wmnet [13:04:19] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140667 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by akosiaris@cumin1002 Renumbering for host wikikube-worker2107.codfw.wm... [13:04:28] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2107.codfw.wmnet with OS bullseye [13:04:38] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140668 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2107.codfw.wmnet with OS bull... [13:04:38] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2107 [13:04:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T370903)', diff saved to https://phabricator.wikimedia.org/P69045 and previous config saved to /var/cache/conftool/dbconfig/20240912-130441-ladsgroup.json [13:04:43] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:04:45] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:05:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2109 [13:06:03] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2108.codfw.wmnet [13:06:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T367781)', diff saved to https://phabricator.wikimedia.org/P69046 and previous config saved to /var/cache/conftool/dbconfig/20240912-130608-arnaudb.json [13:06:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2395 to wikikube-worker2109 [13:06:15] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140673 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by akosiaris@cumin1002 Renumbering for host wikikube-worker2108.codfw.wm... [13:06:19] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2108.codfw.wmnet with OS bullseye [13:06:27] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2395 to wikikube-worker2109 completed: - mw2395 (**... [13:06:30] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140680 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2108.codfw.wmnet with OS bull... [13:06:34] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2396 to wikikube-worker2110 [13:07:21] anyone deploying? [13:07:43] i was kinda hoping Lucas_WMDE was the deployer today, since he did the MOS namespace stuff on Tuesday [13:08:02] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2107 - akosiaris@cumin1002" [13:08:18] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2107 - akosiaris@cumin1002" [13:08:18] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:08:18] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2107.codfw.wmnet 53.0.192.10.in-addr.arpa 3.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:08:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2107.codfw.wmnet 53.0.192.10.in-addr.arpa 3.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:08:22] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2107 [13:08:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2107 [13:08:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2107 [13:08:40] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:08:49] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2108 [13:08:50] PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:08:58] PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:09:01] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2015.codfw.wmnet [13:09:36] PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:09:38] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2109.codfw.wmnet [13:09:46] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140703 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by akosiaris@cumin1002 Renumbering for host wikikube-worker2109.codfw.wm... [13:09:51] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2109.codfw.wmnet with OS bullseye [13:10:02] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140704 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2109.codfw.wmnet with OS bull... [13:10:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2015.codfw.wmnet [13:11:24] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:11:53] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2396 to wikikube-worker2110 - akosiaris@cumin1002" [13:11:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2396 to wikikube-worker2110 - akosiaris@cumin1002" [13:11:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:11:59] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2110 [13:12:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2110 [13:12:41] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:12:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2396 to wikikube-worker2110 [13:13:11] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2396 to wikikube-worker2110 completed: - mw2396 (**... [13:13:12] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:13:32] PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:14:35] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2397 to wikikube-worker2111 [13:14:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P69047 and previous config saved to /var/cache/conftool/dbconfig/20240912-131436-arnaudb.json [13:14:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:14:58] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2108.codfw.wmnet 58.0.192.10.in-addr.arpa 8.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:15:00] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:15:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2108.codfw.wmnet 58.0.192.10.in-addr.arpa 8.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:15:02] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2108 [13:15:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2108 [13:15:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2108 [13:15:31] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2109 [13:16:10] PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:17:05] i guess there's no backport window then [13:18:09] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on poolcounter2006.codfw.wmnet with reason: host reimage [13:18:21] (03PS1) 10Muehlenhoff: puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) [13:18:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10140713 (10elukey) Something not really great: on sretest2001 one of the 10G interfaces has a link up, that I can confirm via BIOS, but not via Redfish. {F57502926} `... [13:19:01] MatmaRex: I am here [13:19:02] sorry [13:19:03] :) [13:19:25] hurray! [13:19:32] oh! thanks [13:19:45] we should elevate you both to the rank of deployer [13:19:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P69048 and previous config saved to /var/cache/conftool/dbconfig/20240912-131948-ladsgroup.json [13:19:50] PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:19:59] so much responsibility :( [13:20:14] (03PS2) 10Ladsgroup: admin: Add echukwukere to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072311 (https://phabricator.wikimedia.org/T374386) [13:20:15] it sounds scarier than it really is [13:20:19] (03CR) 10Ladsgroup: [V:03+2 C:03+2] admin: Add echukwukere to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072311 (https://phabricator.wikimedia.org/T374386) (owner: 10Ladsgroup) [13:20:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) (owner: 10Bartosz Dziewoński) [13:20:37] lets fix WikimediaDebug [13:20:41] <3 [13:20:42] i think i might actually still technically have the permission bits, from when parsoid was a service, but i haven't deployed in like 5 years. [13:20:45] (03CR) 10Vgutierrez: [C:03+1] hiera: continue haproxykafka tests on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1072484 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [13:20:46] sorry I did not spot the missing `$` [13:20:47] (03CR) 10CI reject: [V:04-1] puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [13:20:47] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:20:54] I ran into that last night and worried I was doing it wrongly. [13:21:11] (03Merged) 10jenkins-bot: logging: Fix WikimediaDebug "Verbose logging" option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072330 (https://phabricator.wikimedia.org/T374583) (owner: 10Bartosz Dziewoński) [13:21:15] Thanks for fixing so promptly! [13:21:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P69049 and previous config saved to /var/cache/conftool/dbconfig/20240912-132116-arnaudb.json [13:21:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on poolcounter2006.codfw.wmnet with reason: host reimage [13:21:31] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072330|logging: Fix WikimediaDebug "Verbose logging" option (T374583)]] [13:21:35] T374583: Uncaught UnexpectedValueException: Udp transport "udp:///XWikimediaDebug" must specify a host - https://phabricator.wikimedia.org/T374583 [13:21:41] cscott: sounds like you are all set so! :] Next thing is using `scap backport 1071067` knowing how to use the wikimedia debug extension and that is pretyt much it [13:21:51] RECOVERY - Hadoop NodeManager on an-worker1079 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:21:53] (03CR) 10Fabfur: [C:03+2] hiera: continue haproxykafka tests on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1072484 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [13:22:00] (03PS3) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 [13:22:07] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10140725 (10elukey) Nasty issue found for sretest2001: T365167#10140713 In the provision cookbook we loop through t... [13:22:25] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [13:22:33] RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:22:50] 06SRE, 06Infrastructure-Foundations, 10netops: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619 (10cmooney) 03NEW p:05Triage→03Low [13:22:51] PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:22:59] yeah, except for the "knowing what to do if things go wrong" part [13:23:09] that is where releng is useful :D [13:23:14] we need a panic button really [13:23:31] that scream loudly: RELENG COME ASSIST PLEASE [13:23:31] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2397 to wikikube-worker2111 - akosiaris@cumin1002" [13:23:39] !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1072330|logging: Fix WikimediaDebug "Verbose logging" option (T374583)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:23:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2397 to wikikube-worker2111 - akosiaris@cumin1002" [13:23:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:23:58] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2111 [13:24:03] tested, I can't see errors anymore [13:24:05] !log hashar@deploy1003 matmarex, hashar: Continuing with sync [13:24:09] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2111 [13:24:21] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2109 - akosiaris@cumin1002" [13:24:25] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2109 - akosiaris@cumin1002" [13:24:25] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:25] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2109.codfw.wmnet 59.0.192.10.in-addr.arpa 9.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:24:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2109.codfw.wmnet 59.0.192.10.in-addr.arpa 9.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:24:29] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2109 [13:24:31] damn [13:24:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2109 [13:24:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2109 [13:24:45] that `!log` entry spam on irc is really verbose [13:24:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2397 to wikikube-worker2111 [13:24:56] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140757 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2397 to wikikube-worker2111 completed: - mw2397 (**... [13:25:02] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2107.codfw.wmnet with reason: host reimage [13:25:14] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2110.codfw.wmnet [13:25:25] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140759 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by akosiaris@cumin1002 Renumbering for host wikikube-worker2110.codfw.wm... [13:25:29] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2110.codfw.wmnet with OS bullseye [13:25:39] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2110 [13:25:41] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140760 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2110.codfw.wmnet with OS bull... [13:25:53] (03CR) 10Ladsgroup: "I confirmed it oob." [puppet] - 10https://gerrit.wikimedia.org/r/1072308 (https://phabricator.wikimedia.org/T374008) (owner: 10Ladsgroup) [13:25:54] (03PS2) 10Muehlenhoff: puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) [13:25:57] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2398 to wikikube-worker2112 [13:26:12] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:26:24] /ignore logmsgbot [13:26:26] /ignore wikibugs [13:26:29] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2111.codfw.wmnet [13:26:41] yeah that is quieter [13:26:42] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2111.codfw.wmnet with OS bullseye [13:26:43] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140764 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by akosiaris@cumin1002 Renumbering for host wikikube-worker2111.codfw.wm... [13:26:51] RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:26:52] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10140765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host wikikube-worker2111.codfw.wmnet with OS bull... [13:28:10] and my guess is the deployment is going to fail as the kubernetes workers are being reimaged [13:28:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2107.codfw.wmnet with reason: host reimage [13:28:15] hashar: nope [13:28:19] (03CR) 10CI reject: [V:04-1] puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [13:28:26] workers are depooled [13:28:32] awesome! :-] [13:28:37] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072330|logging: Fix WikimediaDebug "Verbose logging" option (T374583)]] (duration: 07m 06s) [13:28:40] T374583: Uncaught UnexpectedValueException: Udp transport "udp:///XWikimediaDebug" must specify a host - https://phabricator.wikimedia.org/T374583 [13:28:47] claime: I am quite happy that got fixed! [13:28:49] I've removed the SRE tag from the task for IP renumbering that was spamming -operations as well [13:28:50] (03PS1) 10Brouberol: cloudnative-pg-cluster: setup good defaults allowing a cluster to be restored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072546 (https://phabricator.wikimedia.org/T372281) [13:28:59] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [13:29:06] so it should be a little quieter [13:29:09] thanks [13:29:28] for the `!log` spam, I am merely relaying a complain I have seen yesterday or earlier about the same [13:29:33] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2110 - akosiaris@cumin1002" [13:29:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T367781)', diff saved to https://phabricator.wikimedia.org/P69050 and previous config saved to /var/cache/conftool/dbconfig/20240912-132943-arnaudb.json [13:29:47] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:29:47] though that was for helm which is in some case very verbose [13:29:48] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2110 - akosiaris@cumin1002" [13:29:48] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:29:49] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2110.codfw.wmnet 60.0.192.10.in-addr.arpa 0.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:29:52] anyway lets do cscott patch [13:29:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2110.codfw.wmnet 60.0.192.10.in-addr.arpa 0.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:29:53] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2110 [13:30:02] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:30:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2110 [13:30:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2110 [13:30:09] MatmaRex might be quicker? [13:30:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [13:30:15] the cleanup titles on enwiki will take ~30min [13:30:25] hashar: that is difficult to address as we do want the cookbooks to log [13:30:27] ah [13:30:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [13:30:34] cscott: MatmaRex well I can deploy both patches [13:30:36] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2111 [13:30:57] my other thing is just some maintenance script runs, no related patch [13:30:59] hmm no MatmaRex one is just about running the commands as I get it [13:31:03] i don't know how long DeleteTag takes on commons [13:31:10] (03PS4) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 [13:31:14] RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:31:14] (03Merged) 10jenkins-bot: Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [13:31:22] well I imagine you can do them in parallel from the mwmaint hosts? [13:31:26] a few seconds, there's only a couple of these tags [13:31:34] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1071067|Elevate pseudo-namespace MOS to a real namespace on enwiki (T363538)]] [13:31:39] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [13:31:44] yeah, i figured that would be quick(er than mine) [13:32:08] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2108.codfw.wmnet with reason: host reimage [13:32:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:32:24] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2112 [13:32:36] RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:32:45] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:32:50] and Ican't baby sit it for half+hour cause I have an appointment :/ [13:32:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2112 [13:33:34] !log hashar@deploy1003 cscott, hashar: Backport for [[gerrit:1071067|Elevate pseudo-namespace MOS to a real namespace on enwiki (T363538)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:33:36] MatmaRex: should I run the first deleteTag? [13:33:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2398 to wikikube-worker2112 [13:33:45] !log hashar@deploy1003 cscott, hashar: Continuing with sync [13:33:55] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2399 to wikikube-worker2113 [13:33:56] i can baby sit, maybe I can sit in the screen? a chance to see if i actually have the right permission bits i guess. [13:34:25] cscott: if you are in the deployer group, that should work [13:34:26] (03PS1) 10Slyngshede: Menu: Add menu entry for managers to view pending permission requests. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072547 [13:34:34] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2112.codfw.wmnet [13:34:36] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:34:43] 06SRE, 06Infrastructure-Foundations, 10netops: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10140791 (10cmooney) [13:34:48] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2112.codfw.wmnet with OS bullseye [13:34:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P69051 and previous config saved to /var/cache/conftool/dbconfig/20240912-133456-ladsgroup.json [13:35:11] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [13:35:15] hashar: you could if you want. this can wait for another day though if you're already doing the other thing [13:35:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2108.codfw.wmnet with reason: host reimage [13:35:43] Script '/srv/mediawiki-staging/php-1.43.0-wmf.22/maintenance/DeleteTag' not found [13:35:44] hehe [13:35:48] I guess it is from an extension? [13:35:53] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2111 - akosiaris@cumin1002" [13:35:59] RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:36:06] (03PS5) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) [13:36:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2111 - akosiaris@cumin1002" [13:36:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:13] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2111.codfw.wmnet 61.0.192.10.in-addr.arpa 1.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:36:13] hmm, maybe mwscript is dumber than i thought [13:36:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2111.codfw.wmnet 61.0.192.10.in-addr.arpa 1.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:36:17] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2111 [13:36:20] deleteTag [13:36:21] (03CR) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [13:36:22] initial lowercase [13:36:24] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [13:36:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P69052 and previous config saved to /var/cache/conftool/dbconfig/20240912-133623-arnaudb.json [13:36:25] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:36:26] 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Persistence-SRE, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#10140794 (10ABran-WMF) [13:36:27] tried path '/srv/mediawiki-staging/php-1.43.0-wmf.22/maintenance/DeleteTag.php' and class '/srv/mediawiki-staging/php-1\43\0-wmf\22/maintenance/DeleteTag' [13:36:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2111 [13:36:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2111 [13:36:35] MatmaRex: deleteTag.php [13:36:35] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:36:44] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2112 [13:36:50] hashar: deleteTag.php (sorry) [13:36:51] ah it is run.php DeleteTag [13:36:56] cscott: hashar: uppercase worked with run.php for me 🤷‍♂️ [13:37:11] yeah, try deleteTag.php instead [13:37:24] probably mwscript and run.php diverge [13:37:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host poolcounter2006.codfw.wmnet with OS bookworm [13:37:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host poolcounter2006.codfw.wmnet [13:38:14] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071067|Elevate pseudo-namespace MOS to a real namespace on enwiki (T363538)]] (duration: 06m 39s) [13:38:18] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [13:38:19] MatmaRex: https://phabricator.wikimedia.org/T373700#10140801 [13:38:39] cscott: your namespace patch is deployed, so I guess you can run the namespace dupe script from the mwmaint server [13:38:43] hashar: nice. that looks good [13:39:02] thank you! [13:39:06] MatmaRex: \o/ [13:39:14] (03PS2) 10Ladsgroup: admin: Add Philippe Saade to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072308 (https://phabricator.wikimedia.org/T374008) [13:39:18] (03CR) 10Ladsgroup: [V:03+2 C:03+2] admin: Add Philippe Saade to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072308 (https://phabricator.wikimedia.org/T374008) (owner: 10Ladsgroup) [13:39:34] for the error log to group0 , that would wait next week I think. I am too busy today :/ [13:39:39] yeah, i can ssh to mwmaint1002, is that equivalent of saying I have the required permission bits? [13:39:46] maybe! [13:40:01] (03PS5) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 [13:40:17] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2109.codfw.wmnet with reason: host reimage [13:40:31] $ ./modules/admin/data/matrix.py cscott [13:40:31] groups/users cscott [13:40:31] deployment OK [13:40:38] (03PS2) 10Ilias Sarantopoulos: httpbb: add article-models namespace tests for articlequality [puppet] - 10https://gerrit.wikimedia.org/r/1063213 (https://phabricator.wikimedia.org/T360455) [13:40:48] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for EChukwukere-WMF - https://phabricator.wikimedia.org/T374386#10140797 (10Ladsgroup) 05Open→03Resolved a:05eoghan→03Ladsgroup https://ldap.toolforge.org/user/echukwukere [13:41:11] RECOVERY - Hadoop NodeManager on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:41:12] ok, i started a tmux, wish me luck :) [13:41:22] so you can start a screen, use `script T363538.log` to keep a log file of the output, and run the namespace dupes command it [13:41:29] or tmux :D [13:42:25] also `!log` here the command you are running :] [13:42:25] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:42:32] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2399 to wikikube-worker2113 - akosiaris@cumin1002" [13:42:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2399 to wikikube-worker2113 - akosiaris@cumin1002" [13:42:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:42:38] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2113 [13:42:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2109.codfw.wmnet with reason: host reimage [13:42:58] !log Afternoon backport deployments are completed . NamespaceDupe is being run on enwiki for T363538#10140642 [13:43:00] !log mwscript namespaceDupes enwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-prefix=T363538/ --fix | tee ~/T363538-enwiki-namespaceDupes [13:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:15] (03PS6) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 [13:43:54] 2>&1 ! [13:43:56] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2113 [13:44:12] (03PS5) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [13:44:26] good call [13:44:34] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2399 to wikikube-worker2113 [13:44:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:39] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2112.codfw.wmnet 62.0.192.10.in-addr.arpa 2.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:44:41] (03PS7) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 [13:44:42] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2112.codfw.wmnet 62.0.192.10.in-addr.arpa 2.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:44:43] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2112 [13:44:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2112 [13:44:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2112 [13:45:48] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2110.codfw.wmnet with reason: host reimage [13:46:03] (03PS6) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [13:46:17] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2113.codfw.wmnet [13:46:34] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2113.codfw.wmnet [13:47:11] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2113.codfw.wmnet [13:47:18] `script` is quite nice since it records the raw terminal data with timing [13:47:25] so you can literally replay the session :) [13:47:27] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2113.codfw.wmnet [13:47:30] but yeah that is hardcore [13:47:35] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2107.codfw.wmnet with OS bullseye [13:47:51] (03PS3) 10Abijeet Patro: Enable message group subscription feature for Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) [13:48:42] !log homer lsw1-a3-codfw* commit 'T372878' [13:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:45] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [13:49:18] !log namespaceDupes crashed on MOS:_OVERLINKING, re-running with --add-suffix [13:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:24] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2110.codfw.wmnet with reason: host reimage [13:49:26] (we saw this on Tuesday on aswiki as well) [13:49:32] (03CR) 10Brouberol: rdf-streaming-updater: switch to calico-based network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [13:49:48] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] codfw1dev: enable new vxlan-based subnet CIDR in cloudgw and keystone [puppet] - 10https://gerrit.wikimedia.org/r/1072538 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [13:50:01] !log mwscript namespaceDupes enwiki --source-pseudo-namespace MOS --dest-namespace 126 --move-talk --add-suffix=/T363538 --fix 2>&1 | tee ~/T363538-enwiki-namespaceDupes.take2 [13:50:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T370903)', diff saved to https://phabricator.wikimedia.org/P69054 and previous config saved to /var/cache/conftool/dbconfig/20240912-135003-ladsgroup.json [13:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:08] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:50:39] !log homer cr*codfw* commit 'T372878' [13:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:07] (03CR) 10Hnowlan: php8.1: add php8.1-uuid to php8.1-cli and cascade (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [13:51:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T367781)', diff saved to https://phabricator.wikimedia.org/P69055 and previous config saved to /var/cache/conftool/dbconfig/20240912-135131-arnaudb.json [13:51:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1209.eqiad.wmnet with reason: Maintenance [13:51:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1209.eqiad.wmnet with reason: Maintenance [13:51:36] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:51:38] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2113.codfw.wmnet on all recursors [13:51:42] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2113.codfw.wmnet on all recursors [13:51:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T367781)', diff saved to https://phabricator.wikimedia.org/P69056 and previous config saved to /var/cache/conftool/dbconfig/20240912-135142-arnaudb.json [13:51:50] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2113.codfw.wmnet [13:52:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [13:52:05] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2113.codfw.wmnet with OS bullseye [13:52:09] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2111.codfw.wmnet with reason: host reimage [13:52:12] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [13:52:15] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2113 [13:52:30] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:52:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T367781)', diff saved to https://phabricator.wikimedia.org/P69057 and previous config saved to /var/cache/conftool/dbconfig/20240912-135251-arnaudb.json [13:52:53] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, [13:52:53] /IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:13] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10140864 (10Ladsgroup) 05Stalled→03Resolved https://ldap.toolforge.org/user/philippesaade [13:55:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2111.codfw.wmnet with reason: host reimage [13:55:49] the lsw1-a3 alert is because the hosts are still in reimaging [13:56:14] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2107.codfw.wmnet [13:56:14] I 've pushed it as much as I could in doing things in parallel, and well, there's race conditions alright [13:56:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2107.codfw.wmnet [13:56:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2107.codfw.wmnet [13:56:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10140887 (10ssingh) [13:56:22] !log mwscript cleanupTitles enwiki 2>&1 | tee ~/T363538-enwiki-cleanupTitles [13:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:30] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2108.codfw.wmnet with OS bullseye [13:56:32] (03PS7) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [13:57:44] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 315, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:45] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes mw2390 and mw2394-mw2399 - https://phabricator.wikimedia.org/T374622 (10akosiaris) 03NEW [13:57:46] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10140885 (10ssingh) Thanks for filing this task! This is indeed something we have discussed in the past but not formally so let's use this task to do th... [13:59:01] (03PS4) 10Ssingh: wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 [14:00:05] denisse and godog: I, the Bot under the Fountain, call upon thee, The Deployer, to do Alert hosts failover to alert2002 deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1400). [14:00:34] oh fun. gl denisse and godog! [14:00:45] hehe thank you sukhe [14:00:48] sukhe: Thanks! 🤞 [14:00:54] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2112.codfw.wmnet with reason: host reimage [14:01:09] the cleanup titles maintenance script is still running on mwmaint1002, i assume that doesn't conflict with the alert hosts stuff? [14:01:23] cscott: that's right yeah, thank you for the heads up tho [14:01:33] !log Enable the alert[12]002 hosts as alertmanagers [14:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:45] !log Enable the alert[12]002 hosts as alertmanagers - T372418 [14:01:47] (03CR) 10Andrea Denisse: [C:03+2] alert: Enable the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1072318 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [14:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:48] T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418 [14:02:35] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2109.codfw.wmnet with OS bullseye [14:02:44] !log Disable meta-monitoring for the alert hosts - T372418 [14:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:07] (03PS3) 10Scott French: php8.1: add php8.1-uuid to php8.1-cli and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) [14:04:19] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2112.codfw.wmnet with reason: host reimage [14:04:25] !log Make alert2002 the active host - T372418 [14:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:37] (03CR) 10Andrea Denisse: [C:03+2] alert: Failover from alert1001 to alert2002 [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [14:04:44] (03CR) 10Scott French: php8.1: add php8.1-uuid to php8.1-cli and cascade (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [14:05:15] RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:07:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439) (owner: 10Hamish) [14:07:25] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 397, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:07:47] (03CR) 10Andrea Denisse: [C:03+2] alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1072326 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [14:07:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P69058 and previous config saved to /var/cache/conftool/dbconfig/20240912-140758-arnaudb.json [14:09:12] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2113 - akosiaris@cumin1002" [14:09:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2113 - akosiaris@cumin1002" [14:09:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:09:16] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2113.codfw.wmnet 63.0.192.10.in-addr.arpa 3.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:09:17] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:09:19] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2113.codfw.wmnet 63.0.192.10.in-addr.arpa 3.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:09:20] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2113 [14:09:22] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2113 [14:09:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2113 [14:10:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2110.codfw.wmnet with OS bullseye [14:13:36] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:13:54] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:14:34] I can access wikitech static, so potentially an ongoing maintenance fallout? [14:15:06] wfm too [14:15:19] yeah likely that [14:15:29] (03CR) 10CI reject: [V:04-1] wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh) [14:15:34] the other alert looks more interesting [14:15:54] the other is almost certainly related to the reimaging of wikikube-worker2110.codfw.wmnet [14:16:01] or wikikube-worker2113 more accurately [14:16:17] (03PS5) 10Ssingh: wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 [14:16:31] ah, ok then, as that looked more "real" [14:17:58] !log sudo cumin "A:dnsbox" "rm /etc/ntp.conf": cleaning up ntpd configuration file to avoid confusion with ntpsec.conf [14:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:38] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2109.codfw.wmnet [14:18:41] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2109.codfw.wmnet [14:18:41] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2109.codfw.wmnet [14:20:03] (03PS1) 10Hnowlan: shellbox-video: add process-based readiness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) [14:21:50] (03CR) 10JHathaway: [C:03+1] Only run puppetserver spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1072505 (owner: 10Muehlenhoff) [14:23:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P69059 and previous config saved to /var/cache/conftool/dbconfig/20240912-142306-arnaudb.json [14:23:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2112.codfw.wmnet with OS bullseye [14:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:25:34] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2113.codfw.wmnet with reason: host reimage [14:27:56] (03PS1) 10EoghanGaffney: lists: Switch from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1072551 [14:28:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2113.codfw.wmnet with reason: host reimage [14:28:24] (03CR) 10JHathaway: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [14:28:25] (03PS1) 10Slyngshede: Allow users to see rejected requests for permissions. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072552 [14:29:10] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1072551 (owner: 10EoghanGaffney) [14:29:11] (03CR) 10CI reject: [V:04-1] wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh) [14:29:37] (03CR) 10Hnowlan: [C:03+1] php8.1: add php8.1-uuid to php8.1-cli and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [14:30:07] !log cleanupTitles on enwiki complete (T363538) [14:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:13] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [14:30:42] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2004.codfw.wmnet - https://phabricator.wikimedia.org/T374594#10141048 (10Jhancock.wm) a:03Jhancock.wm [14:30:43] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10141051 (10MBinder_WMF) Thanks, both. I tried to ssh into phab1004.eqiad.wmnet with bast1003.wikimedia.org in the config file, and got the same issue.... [14:30:59] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2004.codfw.wmnet - https://phabricator.wikimedia.org/T374594#10141054 (10Jhancock.wm) 05Open→03Resolved [14:31:16] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10141040 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:32:02] win 26 [14:32:10] oops [14:33:25] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10141070 (10Jhancock.wm) part arriving today. will update when swapped. [14:33:42] cleanup titles took just over 30min to complete on enwiki, as Lucas_WMDE predicted [14:33:58] (03PS6) 10Ssingh: wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 [14:34:38] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374422#10141072 (10Jhancock.wm) part should be arriving some time today. we can schedule down time for the server to get it swapped when ready. [14:37:57] cscott: sorry, I was at a department offsite all day today. glad to hear it worked out! [14:38:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T367781)', diff saved to https://phabricator.wikimedia.org/P69060 and previous config saved to /var/cache/conftool/dbconfig/20240912-143813-arnaudb.json [14:38:18] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:38:29] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:38:45] (03PS6) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) [14:38:49] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:39:09] (03CR) 10Bking: rdf-streaming-updater: switch to calico-based network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [14:39:12] FIRING: [2x] JobUnavailable: Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:37] (03PS1) 10Elukey: sre.hosts.provision: refactor _config_dell_pxe() [cookbooks] - 10https://gerrit.wikimedia.org/r/1072553 (https://phabricator.wikimedia.org/T365372) [14:42:41] Lucas_WMDE: no worries, thanks for writing such a clean postmortem for me to follow when I had to do it myself! [14:42:49] * cscott was very nervous [14:43:49] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes mw2390 and mw2394-mw2399 - https://phabricator.wikimedia.org/T374622#10141130 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:45:01] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:45:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10141140 (10phaultfinder) [14:45:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [14:47:30] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10141141 (10elukey) >>! In T365372#10140725, @elukey wrote: > Nasty issue found for sretest2001: T365167#10140713 >... [14:47:34] (03CR) 10Nikerabbit: [C:03+1] Enable message group subscription feature for Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:47:46] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2113.codfw.wmnet with OS bullseye [14:48:11] (03CR) 10Arnaudb: [C:04-1] "temporary -1 to reduce in progress" [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb) [14:48:19] (03PS3) 10Muehlenhoff: puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) [14:50:59] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 24.11 ms [14:52:33] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:52:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [14:52:43] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:52:50] (03CR) 10JHathaway: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [14:55:29] (03PS1) 10Hamish: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072556 (https://phabricator.wikimedia.org/T374621) [14:55:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072227 (https://phabricator.wikimedia.org/T374484) (owner: 10Superzerocool) [14:57:22] (03PS7) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) [14:57:42] (03CR) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [14:59:15] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:59:26] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:59:45] (03CR) 10CI reject: [V:04-1] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [15:00:04] dduvall and dancy: Your horoscope predicts another Train log triage deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1500). [15:00:29] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:00:32] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:03:30] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 2 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10141209 (10elukey) 05Open→03Resolved [15:03:55] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:23] (03PS8) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) [15:06:04] (03CR) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [15:09:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [15:11:03] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:11:07] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:11:34] (03PS8) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [15:12:38] (03PS4) 10Muehlenhoff: puppetmaster::frontend: Read the server used for puppet-merge from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) [15:14:39] (03PS5) 10Muehlenhoff: puppetmaster::frontend|backend: Read the puppet-merge server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) [15:15:18] (03CR) 10Scott French: "Thank you both for the reviews!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [15:15:21] (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [15:16:23] (03CR) 10Bking: [C:03+1] Enable the performace CPU governor on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072529 (https://phabricator.wikimedia.org/T365878) (owner: 10Btullis) [15:17:06] (03CR) 10Clément Goubert: [C:04-1] shellbox-video: add process-based readiness check (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [15:18:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [15:20:23] !log zabe@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php {fawikiquote,fawikisource,fawiktionary} --skip /home/zabe/text_table_cleanup/{fawikiquote,fawikisource,fawiktionary} --dump /home/zabe/text_table_dump/{fawikiquote,fawikisource,fawiktionary} --sleep 1 # T183490 [15:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:27] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [15:22:58] (03PS3) 10Hashar: tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) [15:23:39] (03CR) 10Hashar: "Rebased to clear a trivial conflict with I75c226a7ed1b0dc91b488ed92242ba5c7da84cac" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [15:24:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Alternatively do it first only for 2001 via a host Hiera entry." [puppet] - 10https://gerrit.wikimedia.org/r/1072551 (owner: 10EoghanGaffney) [15:26:21] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns2006.wikimedia.org [reason: T373102 codfw maintenance] [15:26:25] T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102 [15:27:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10141318 (10jcrespo) I've stopped codfw media backups. @cmooney Would it be possible to get preferencial time on maintenance... [15:27:41] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [15:29:05] !log Depooling kubernetes2044.codfw.wmnet kubernetes2045.codfw.wmnet - T373102 [15:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:47] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2044.codfw.wmnet [15:30:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2044.codfw.wmnet [15:30:25] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2045.codfw.wmnet [15:30:49] (03CR) 10JHathaway: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [15:33:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2045.codfw.wmnet [15:33:44] (03PS9) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) [15:34:18] (03PS2) 10Hnowlan: shellbox-video: add process-based readiness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) [15:37:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [15:37:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2128 db2151 db2170 db2171 db2211 db2212 es2033 es2034 es2039 pc2014 db2209 - T370852', diff saved to https://phabricator.wikimedia.org/P69062 and previous config saved to /var/cache/conftool/dbconfig/20240912-153720-arnaudb.json [15:37:24] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [15:37:28] (03PS3) 10Hnowlan: shellbox-video: add process-based readiness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) [15:37:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:45:00 on 11 hosts with reason: network maintenance T373101 [15:37:45] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [15:38:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on 11 hosts with reason: network maintenance T373101 [15:38:41] (03CR) 10Muehlenhoff: puppetserver: Pass the value of puppet_merge_server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [15:38:55] (03CR) 10Hnowlan: shellbox-video: add process-based readiness check (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [15:39:41] (03CR) 10Jdlrobson: [C:03+1] Fix night mode excepted Wikidata namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488 (owner: 10Ebrahim) [15:40:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool es2034 which was perceived master for es3 - T370852', diff saved to https://phabricator.wikimedia.org/P69063 and previous config saved to /var/cache/conftool/dbconfig/20240912-154008-arnaudb.json [15:40:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10141367 (10ABran-WMF) @cmooney all nodes have been depooled [15:42:39] !log depooling ms-fe2012 moss-fe2002 & thanos-fe2003 — T373102 [15:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:43] T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102 [15:45:07] (03PS1) 10Jdlrobson: Dark mode: Make LiquidThreads namespace explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562 [15:45:48] (03CR) 10Jdlrobson: [C:04-1] "I think https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1072562 makes this a lot cleaner (am mostly worried that if Liquid" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487 (owner: 10Ebrahim) [15:46:02] (03CR) 10Jdlrobson: [C:03+1] "Thanks. I'd overlooked $wmgLiquidThreadsFrozen :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [15:47:57] !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [15:48:00] !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [15:48:32] (03CR) 10JHathaway: [C:03+1] puppetserver: Pass the value of puppet_merge_server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [15:48:49] !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [15:48:52] !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [15:48:59] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:34] (03CR) 10JHathaway: [C:03+1] puppetmaster::frontend|backend: Read the puppet-merge server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [15:49:53] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on 21 hosts with reason: Move server uplinks codfw racks D1 [15:49:54] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 0:20:00 on 21 hosts with reason: Move server uplinks codfw racks D1 [15:50:01] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 21 hosts with reason: Move server uplinks codfw racks D1 [15:50:13] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:50:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:50:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 21 hosts with reason: Move server uplinks codfw racks D1 [15:50:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10141418 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bb570977-8737-4373-95ac-3765685f6e5e) set by cmoon... [15:50:53] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 21 hosts with reason: Move server uplinks codfw racks D2 [15:51:30] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 21 hosts with reason: Move server uplinks codfw racks D2 [15:51:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10141420 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5073d83c-c18b-41a0-aa78-a6da63b209f9) set by cmoon... [15:51:49] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:45] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:56:12] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2111.codfw.wmnet [15:56:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2111.codfw.wmnet [15:56:15] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2111.codfw.wmnet [15:57:11] (03CR) 10Clément Goubert: [C:03+1] "LGTM, let's see what happens" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [15:57:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:58:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:58:25] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:58:33] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2112.codfw.wmnet [15:58:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2112.codfw.wmnet [15:58:37] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2112.codfw.wmnet [15:59:48] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2113.codfw.wmnet [15:59:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2113.codfw.wmnet [15:59:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2113.codfw.wmnet [16:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:08] !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [16:00:12] !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [16:00:50] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2110.codfw.wmnet [16:00:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2110.codfw.wmnet [16:00:54] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2110.codfw.wmnet [16:01:26] !log move server uplinks in codfw rack D1 from asw-d1-codfw to lsw1-d1-codfw T373102 [16:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:30] (03CR) 10Hnowlan: [C:03+2] shellbox-video: add process-based readiness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [16:01:34] T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102 [16:02:58] (03CR) 10Volans: [C:03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh) [16:03:06] (03Merged) 10jenkins-bot: shellbox-video: add process-based readiness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072549 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [16:04:27] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:07:09] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns2006.wikimedia.org [16:07:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2006.wikimedia.org [16:07:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [16:08:21] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [16:08:34] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [16:09:22] !log restart ms-backup200[12] after maintenance and upgrade [16:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:23] PROBLEM - Host ms-backup2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:01] RECOVERY - Host ms-backup2001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [16:12:12] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [16:12:15] that's me, apparently my downtime didn't went through [16:12:38] nothing to see [16:12:42] it was a normal reboot [16:12:56] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [16:13:34] jynus: its a side effect of the way icinga works, I have the same issue :D [16:13:52] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [16:14:20] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [16:14:35] (03CR) 10Ssingh: [C:03+2] wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh) [16:17:46] (03PS1) 10Fabfur: Revert "hiera: continue haproxykafka tests on cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1072565 [16:18:11] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2006.wikimedia.org [reason: [end] T373102 codfw maintenance] [16:18:14] T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102 [16:18:40] (03CR) 10Fabfur: [C:03+2] Revert "hiera: continue haproxykafka tests on cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1072565 (owner: 10Fabfur) [16:19:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10141576 (10cmooney) Everything moved successfully, all ports up on the new switch and everything responding to ping again. [16:19:16] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [16:19:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69064 and previous config saved to /var/cache/conftool/dbconfig/20240912-161916-arnaudb.json [16:19:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69065 and previous config saved to /var/cache/conftool/dbconfig/20240912-161922-arnaudb.json [16:19:24] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [16:19:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69066 and previous config saved to /var/cache/conftool/dbconfig/20240912-161927-arnaudb.json [16:19:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69067 and previous config saved to /var/cache/conftool/dbconfig/20240912-161932-arnaudb.json [16:19:34] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [16:19:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69068 and previous config saved to /var/cache/conftool/dbconfig/20240912-161937-arnaudb.json [16:19:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69069 and previous config saved to /var/cache/conftool/dbconfig/20240912-161942-arnaudb.json [16:19:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69070 and previous config saved to /var/cache/conftool/dbconfig/20240912-161947-arnaudb.json [16:19:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69071 and previous config saved to /var/cache/conftool/dbconfig/20240912-161952-arnaudb.json [16:19:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69072 and previous config saved to /var/cache/conftool/dbconfig/20240912-161957-arnaudb.json [16:20:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P69073 and previous config saved to /var/cache/conftool/dbconfig/20240912-162007-arnaudb.json [16:21:18] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [16:23:00] (03PS1) 10JHathaway: haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) [16:23:21] (03CR) 10CI reject: [V:04-1] haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [16:23:51] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [16:24:07] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10141606 (10VRiley-WMF) 05Open→03Resolved @andrea.denisse This drive has been replaced Please let us know if there are any other issues with this unit. [16:24:07] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [16:24:22] (03PS2) 10JHathaway: haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) [16:25:09] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10141608 (10phaultfinder) [16:26:52] (03CR) 10CI reject: [V:04-1] haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [16:27:24] (03CR) 10Tacsipacsi: "Could this depend on I10e1b24eba946452ba2e18bef67d8a8205fd2e24? At the moment, it doesn’t look like there will be any backward-incompatibl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [16:28:29] (03PS1) 10Ssingh: sre.dns.admin: use set_and_verify for confctl update [cookbooks] - 10https://gerrit.wikimedia.org/r/1072569 [16:28:59] (03PS3) 10JHathaway: haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) [16:30:38] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10141626 (10cmooney) Will re-schedule for Tuesday Sep 17th [16:30:46] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: add cookbook for GeoDNS pool/depool (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [16:30:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10141621 (10cmooney) 05Open→03Resolved a:03cmooney [16:31:03] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2044.codfw.wmnet [16:31:05] !log Repooling kubernetes2044.codfw.wmnet kubernetes2045.codfw.wmnet - T373102 [16:31:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2044.codfw.wmnet [16:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:09] T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102 [16:31:10] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2045.codfw.wmnet [16:31:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2045.codfw.wmnet [16:32:15] !log pooling ms-fe2012 moss-fe2002 & thanos-fe2003 — T373102 [16:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:09] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [16:33:59] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072569 (owner: 10Ssingh) [16:34:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69075 and previous config saved to /var/cache/conftool/dbconfig/20240912-163422-arnaudb.json [16:34:26] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [16:34:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69076 and previous config saved to /var/cache/conftool/dbconfig/20240912-163427-arnaudb.json [16:34:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69077 and previous config saved to /var/cache/conftool/dbconfig/20240912-163433-arnaudb.json [16:34:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69078 and previous config saved to /var/cache/conftool/dbconfig/20240912-163438-arnaudb.json [16:34:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69079 and previous config saved to /var/cache/conftool/dbconfig/20240912-163443-arnaudb.json [16:34:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69080 and previous config saved to /var/cache/conftool/dbconfig/20240912-163448-arnaudb.json [16:34:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69081 and previous config saved to /var/cache/conftool/dbconfig/20240912-163453-arnaudb.json [16:34:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69082 and previous config saved to /var/cache/conftool/dbconfig/20240912-163458-arnaudb.json [16:35:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69083 and previous config saved to /var/cache/conftool/dbconfig/20240912-163503-arnaudb.json [16:35:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P69084 and previous config saved to /var/cache/conftool/dbconfig/20240912-163513-arnaudb.json [16:36:10] (03PS4) 10Kgraessle: Enable AutoModerator on ukwik [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) [16:36:22] !log disable ports for now unused ports on asw-d1-codfw and asw-d2-codfw T373102 [16:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:25] T373102: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102 [16:36:40] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374642 (10ops-monitoring-bot) 03NEW [16:37:52] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#10141677 (10akosiaris) [16:37:55] (03CR) 10Volans: [C:03+1] "conftool has been updated in production, no more blockers" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 (https://phabricator.wikimedia.org/T362893) (owner: 10Giuseppe Lavagetto) [16:39:12] (03PS1) 10Hnowlan: shellbox-video: use correct command in process check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072571 (https://phabricator.wikimedia.org/T373517) [16:39:48] RECOVERY - dump of s8 in eqiad on backupmon1001 is OK: Last dump for s8 at eqiad (db1171) taken on 2024-09-12 09:09:43 (267 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:42:24] (03PS5) 10Kgraessle: Enable AutoModerator on ukwik [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) [16:43:22] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 07Kubernetes: "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#10141701 (10akosiaris) We are seeing this as well on WikiKube nodes. ` 2024-09-12T15:20:55.176734+0... [16:43:52] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10141703 (10wiki_willy) It looks like it'll be 3 drives minimum from the latest email today, and @Jclark-ctr - you c... [16:44:20] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:44:38] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:45:12] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52630 bytes in 0.369 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:45:30] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:45:31] (03CR) 10Clément Goubert: [C:03+1] shellbox-video: use correct command in process check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072571 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [16:46:12] (03CR) 10Hnowlan: [C:03+2] shellbox-video: use correct command in process check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072571 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [16:47:10] (03Merged) 10jenkins-bot: shellbox-video: use correct command in process check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072571 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [16:48:42] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:49:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69085 and previous config saved to /var/cache/conftool/dbconfig/20240912-164927-arnaudb.json [16:49:32] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [16:49:32] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.305 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:49:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69086 and previous config saved to /var/cache/conftool/dbconfig/20240912-164933-arnaudb.json [16:49:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69087 and previous config saved to /var/cache/conftool/dbconfig/20240912-164938-arnaudb.json [16:49:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69088 and previous config saved to /var/cache/conftool/dbconfig/20240912-164943-arnaudb.json [16:49:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69089 and previous config saved to /var/cache/conftool/dbconfig/20240912-164948-arnaudb.json [16:49:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69090 and previous config saved to /var/cache/conftool/dbconfig/20240912-164953-arnaudb.json [16:49:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69091 and previous config saved to /var/cache/conftool/dbconfig/20240912-164959-arnaudb.json [16:50:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69092 and previous config saved to /var/cache/conftool/dbconfig/20240912-165003-arnaudb.json [16:50:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69093 and previous config saved to /var/cache/conftool/dbconfig/20240912-165009-arnaudb.json [16:50:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P69094 and previous config saved to /var/cache/conftool/dbconfig/20240912-165018-arnaudb.json [16:51:44] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@d045bb2]: (no justification provided) [16:52:14] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@d045bb2]: (no justification provided) (duration: 00m 30s) [16:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:57:21] (03PS1) 10Gerrit maintenance bot: Add mos to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1072573 (https://phabricator.wikimedia.org/T374641) [16:58:03] (03PS3) 10Bking: flink-app: create a new label for selecting Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [16:58:46] (03CR) 10Jsn.sherman: "Some style comments inline, but otherwise this looks good to go. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle) [16:59:58] (03PS6) 10Kgraessle: Enable AutoModerator on ukwik [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) [17:00:05] bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1700). [17:00:05] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1700). [17:00:41] (03CR) 10Kgraessle: Enable AutoModerator on ukwik (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle) [17:02:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [17:02:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle) [17:02:34] here o/ [17:03:05] will start work shortly - just getting some other items into a pause-able state [17:04:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69096 and previous config saved to /var/cache/conftool/dbconfig/20240912-170433-arnaudb.json [17:04:39] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [17:04:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69097 and previous config saved to /var/cache/conftool/dbconfig/20240912-170439-arnaudb.json [17:04:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69098 and previous config saved to /var/cache/conftool/dbconfig/20240912-170444-arnaudb.json [17:04:50] (03PS4) 10Bking: flink-app: create a new label for selecting Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [17:04:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69099 and previous config saved to /var/cache/conftool/dbconfig/20240912-170449-arnaudb.json [17:04:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69100 and previous config saved to /var/cache/conftool/dbconfig/20240912-170453-arnaudb.json [17:05:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69101 and previous config saved to /var/cache/conftool/dbconfig/20240912-170459-arnaudb.json [17:05:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69102 and previous config saved to /var/cache/conftool/dbconfig/20240912-170504-arnaudb.json [17:05:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69103 and previous config saved to /var/cache/conftool/dbconfig/20240912-170509-arnaudb.json [17:05:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69104 and previous config saved to /var/cache/conftool/dbconfig/20240912-170514-arnaudb.json [17:05:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'pc2014 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69105 and previous config saved to /var/cache/conftool/dbconfig/20240912-170524-arnaudb.json [17:05:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P69106 and previous config saved to /var/cache/conftool/dbconfig/20240912-170524-arnaudb.json [17:07:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [17:08:51] (03CR) 10Scott French: [C:03+2] mw-debug: add initial "next" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071945 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:09:27] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: use set_and_verify for confctl update [cookbooks] - 10https://gerrit.wikimedia.org/r/1072569 (owner: 10Ssingh) [17:09:50] (03Merged) 10jenkins-bot: mw-debug: add initial "next" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071945 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:10:03] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10141906 (10Eevans) We've assumed that 1013 & 1014 are both impacted by the same issue (or I have, at least), but that might not be a safe assumption; I'd like to try reimaging this one as well.... [17:10:25] nothing for me to deploy in my window today [17:11:27] (03PS1) 10Fabfur: cache:haproxy: hardcode $schema field [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) [17:11:32] !log isaranto@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [17:14:23] !log restarting db1171:s8 mysql process T374610 [17:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:27] T374610: db1171:s8 is having performance issues and lagging - https://phabricator.wikimedia.org/T374610 [17:15:31] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:16:43] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:16:46] (03PS3) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) [17:16:49] (03PS1) 10BCornwall: varnish: Consolidate analytics subroutines [puppet] - 10https://gerrit.wikimedia.org/r/1070688 (https://phabricator.wikimedia.org/T370200) [17:20:08] (03PS1) 10Scott French: Revert "mw-debug: add initial "next" release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072578 (https://phabricator.wikimedia.org/T372604) [17:21:40] (03CR) 10Scott French: [C:03+2] Revert "mw-debug: add initial "next" release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072578 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:22:15] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10141985 (10Eevans) ` eevans@aqs1014:~$ sudo lshw -class disk *-disk:0 description: ATA Disk product: HFS1T9G32FEH-BA1 physical id: 0 bus info: scs... [17:22:45] (03Merged) 10jenkins-bot: Revert "mw-debug: add initial "next" release" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072578 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:25:59] all done on my end [17:26:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10142013 (10phaultfinder) [17:27:53] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1014.eqiad.wmnet with reason: SSD device troubleshooting [17:28:09] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1014.eqiad.wmnet with reason: SSD device troubleshooting [17:28:20] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10142025 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b21f43cd-a8ba-456c-8d9c-3c6cd91457e5) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with... [17:28:49] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10142033 (10wiki_willy) a:03RobH [17:33:20] FIRING: [4x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10142054 (10cmooney) So once we have completed the move for D4 next Tuesday I have a (hopefully) small request. Could the sretest2002 uplinks... [17:33:58] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [17:35:32] (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [17:35:39] (03CR) 10BCornwall: "Manual patches are still fine, so long as the domain exists in markmonitor. I would also like for this functionality and a report exists a" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [17:37:58] (03Abandoned) 10Ebrahim: Make LQT night mode exceptions explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487 (owner: 10Ebrahim) [17:38:22] (03CR) 10Ebrahim: "That looks fantastic indeed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072487 (owner: 10Ebrahim) [17:40:30] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [17:42:49] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10142066 (10VRiley-WMF) @Eevans the drives that were not listed in the group have been replaced. Please let us know if anything else is needed. [17:46:59] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1072586 [17:49:43] FIRING: [4x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:09] jouncebot: nowandnext [17:50:10] For the next 0 hour(s) and 9 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1700) [17:50:10] For the next 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1700) [17:50:10] In 0 hour(s) and 9 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1800) [17:50:36] RESOLVED: [4x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:52:15] (03CR) 10Ladsgroup: [C:03+2] Add mos to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1072573 (https://phabricator.wikimedia.org/T374641) (owner: 10Gerrit maintenance bot) [17:53:55] ACKNOWLEDGEMENT - MD RAID on aqs1014 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 12, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T374652 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:54:00] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T374652 (10ops-monitoring-bot) 03NEW [17:54:42] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [17:55:39] (03CR) 10Jsn.sherman: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle) [18:00:05] dduvall and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1800). [18:06:51] (03PS5) 10Bking: flink-app: create a new label for selecting Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [18:06:59] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072587 (https://phabricator.wikimedia.org/T373641) [18:07:01] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072587 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot) [18:07:36] PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:07:43] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072587 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot) [18:13:36] RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:18:04] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.22 refs T373641 [18:18:08] T373641: 1.43.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T373641 [18:18:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10142127 (10Jhancock.wm) yeah can do [18:20:23] !log ran systemctl reset-failed mediawiki_job_MachineVision_prioritize_uncategorized.service on mwmaint1002 to clear failed state for turned down job - T352884 [18:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:27] T352884: Undeploy and archive the MachineVision extension - https://phabricator.wikimedia.org/T352884 [18:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:29:33] (03CR) 10AOkoth: [C:03+2] vrts: swap replica to new host [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [18:33:45] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Migration [18:34:01] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Migration [18:35:48] (03PS1) 10BCornwall: wip: Remove rsa support [puppet] - 10https://gerrit.wikimedia.org/r/1072590 [18:46:52] (03PS2) 10BCornwall: Remove RSA certificate support [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [18:50:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10142189 (10phaultfinder) [18:54:22] (03PS7) 10Bartosz Dziewoński: Enable AutoModerator on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle) [18:55:08] (03CR) 10Bartosz Dziewoński: "(Fixed typo)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle) [18:57:13] (03CR) 10Scott French: [V:03+2 C:03+2] "Thanks, Hugh!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [19:03:49] !log rebuilt php8.1 production images to pick up php-uuid - T372602 [19:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:53] T372602: Prepare PHP 8.1 production images - https://phabricator.wikimedia.org/T372602 [19:03:55] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) (owner: 10Bartosz Dziewoński) [19:09:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711 (owner: 10Bartosz Dziewoński) [19:09:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728 (owner: 10Bartosz Dziewoński) [19:10:51] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Drop PSON support - https://phabricator.wikimedia.org/T372667#10142209 (10jhathaway) [19:11:29] jouncebot: nowandnext [19:11:29] For the next 0 hour(s) and 48 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T1800) [19:11:29] In 0 hour(s) and 48 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T2000) [19:12:36] (03CR) 10Vgutierrez: "don't forget to remove wikiworkshop's RSA certificate as well" [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:19:07] RECOVERY - Host gerrit1004 is UP: PING WARNING - Packet loss = 33%, RTA = 1.32 ms [19:23:53] (03PS9) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [19:24:30] (03CR) 10Dzahn: [C:03+1] "lgtm, nitpick: update topic to say that it's not the active host yet" [puppet] - 10https://gerrit.wikimedia.org/r/1072551 (owner: 10EoghanGaffney) [19:24:51] (03CR) 10CI reject: [V:04-1] rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [19:25:31] PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:31:56] (03PS10) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [19:33:57] (03PS11) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [19:41:16] (03PS1) 10JHathaway: puppet8: ensure kerberos keytab type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) [19:41:36] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [19:47:34] (03PS2) 10Fabfur: cache:haproxy: hardcode $schema field [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) [19:48:55] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:52:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [19:55:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [19:55:28] (03PS3) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [19:55:48] (03CR) 10CI reject: [V:04-1] varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:56:11] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [19:57:47] (03PS3) 10Fabfur: cache:haproxy: hardcode $schema field [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) [19:57:54] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on vrts2002.codfw.wmnet with reason: Migration [19:57:59] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on vrts2002.codfw.wmnet with reason: Migration [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240912T2000). [20:00:05] Hamishcz, Superzerocool, katherine_g, MatmaRex, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:52] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [20:00:54] here [20:00:59] yes [20:01:11] hi - i can deploy [20:01:26] i'll do Hamishcz's patch first [20:02:02] (03PS3) 10Hamish: u4cwiki: create case and case_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439) [20:02:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439) (owner: 10Hamish) [20:03:24] (03Merged) 10jenkins-bot: u4cwiki: create case and case_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439) (owner: 10Hamish) [20:03:30] (hi) [20:03:35] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072204|u4cwiki: create case and case_talk namespaces (T374439)]] [20:03:37] cjming, For test, as I cannot access u4cwiki, I cannot do a real test but the code is wonderful IMO [20:03:40] T374439: Create case and case_talk namespaces in u4cwiki - https://phabricator.wikimedia.org/T374439 [20:03:59] Hamishcz: np - i'll sync and run the namespace dupes script on it [20:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 15.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:04:25] (03PS4) 10BCornwall: varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [20:04:30] sure thanks a lot [20:04:38] np! [20:04:47] (03CR) 10CI reject: [V:04-1] varnish: Occasional RSA cert connection warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:05:30] o/ [20:06:23] !log cjming@deploy1003 hamishz, cjming: Backport for [[gerrit:1072204|u4cwiki: create case and case_talk namespaces (T374439)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:06:38] !log cjming@deploy1003 hamishz, cjming: Continuing with sync [20:07:09] Superzerocool: are you around? [20:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 15.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:10:37] katherine_g: i'll do yours next [20:10:45] (03PS1) 10Bking: rdf-streaming-updater: trigger a savepoint before firewall changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072597 [20:10:54] sounds good [20:10:59] (03PS8) 10Bartosz Dziewoński: Enable AutoModerator on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle) [20:11:11] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072204|u4cwiki: create case and case_talk namespaces (T374439)]] (duration: 07m 36s) [20:11:15] T374439: Create case and case_talk namespaces in u4cwiki - https://phabricator.wikimedia.org/T374439 [20:11:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle) [20:12:16] Hamishcz: your change should be live [20:12:30] (03Merged) 10jenkins-bot: Enable AutoModerator on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) (owner: 10Kgraessle) [20:12:44] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1071223|Enable AutoModerator on ukwiki (T373823)]] [20:12:47] T373823: Enable AutoModerator on ukwiki - https://phabricator.wikimedia.org/T373823 [20:12:53] if we get to my patches today, you can do them all at once and without testing on mwdebug – they are all only removing completely unused config variables, checked in codesearch [20:13:22] MatmaRex: sounds good and will do [20:14:23] i'm good to sync [20:14:28] cjming, yeh I confirmed its live status in repo, but I cannot really see it, will contact someone to confirm, However it's a easy code so basically no problem [20:14:38] !log cjming@deploy1003 kgraessle, cjming: Backport for [[gerrit:1071223|Enable AutoModerator on ukwiki (T373823)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:14:40] appreciate [20:14:43] np! [20:14:47] !log cjming@deploy1003 kgraessle, cjming: Continuing with sync [20:14:54] thanks! [20:14:58] yw! [20:15:17] (03PS4) 10Bartosz Dziewoński: Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) [20:15:22] (03PS2) 10Bartosz Dziewoński: Remove unused $wmgPoweredByMediaWikiIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711 [20:15:30] (03PS2) 10Bartosz Dziewoński: Remove unused settings removed in T339959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728 [20:19:45] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071223|Enable AutoModerator on ukwiki (T373823)]] (duration: 07m 01s) [20:19:49] T373823: Enable AutoModerator on ukwiki - https://phabricator.wikimedia.org/T373823 [20:19:53] (03CR) 10Clare Ming: [C:03+2] Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) (owner: 10Bartosz Dziewoński) [20:20:39] (03Merged) 10jenkins-bot: Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) (owner: 10Bartosz Dziewoński) [20:20:51] (03PS3) 10Bartosz Dziewoński: Remove unused $wmgPoweredByMediaWikiIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711 [20:22:21] (03CR) 10Clare Ming: [C:03+2] Remove unused $wmgPoweredByMediaWikiIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711 (owner: 10Bartosz Dziewoński) [20:23:04] (03Merged) 10jenkins-bot: Remove unused $wmgPoweredByMediaWikiIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071711 (owner: 10Bartosz Dziewoński) [20:23:04] katherine_g: your patch should be live! [20:23:20] (03PS3) 10Bartosz Dziewoński: Remove unused settings removed in T339959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728 [20:23:34] k looks good! [20:24:24] (03CR) 10Clare Ming: [C:03+2] Remove unused settings removed in T339959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728 (owner: 10Bartosz Dziewoński) [20:25:16] (03Merged) 10jenkins-bot: Remove unused settings removed in T339959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071728 (owner: 10Bartosz Dziewoński) [20:25:39] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1065299|Remove unused $wgAllowRequiringEmailForResets (T242406)]], [[gerrit:1071711|Remove unused $wmgPoweredByMediaWikiIcon]], [[gerrit:1071728|Remove unused settings removed in T339959]] [20:25:44] T242406: Remove $wgAllowRequiringEmailForResets feature flag [small] - https://phabricator.wikimedia.org/T242406 [20:25:44] T339959: Reduce CentralAuth complexity by removing unused settings - https://phabricator.wikimedia.org/T339959 [20:28:22] !log cjming@deploy1003 matmarex, cjming: Backport for [[gerrit:1065299|Remove unused $wgAllowRequiringEmailForResets (T242406)]], [[gerrit:1071711|Remove unused $wmgPoweredByMediaWikiIcon]], [[gerrit:1071728|Remove unused settings removed in T339959]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:28:26] !log cjming@deploy1003 matmarex, cjming: Continuing with sync [20:29:45] (03PS2) 10Bking: rdf-streaming-updater: trigger a savepoint before firewall changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072597 (https://phabricator.wikimedia.org/T373195) [20:30:10] (03PS1) 10Ebrahim: Remove ProofreadPage exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 [20:30:20] (03PS21) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) [20:31:09] (03CR) 10Ebrahim: "Just FYI that the extension is getting fixed also." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim) [20:32:05] oh God, I'm so late for deploy :( [20:32:49] Superzerocool: no worries! good timing actually - i can do yours here shortly [20:32:58] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1065299|Remove unused $wgAllowRequiringEmailForResets (T242406)]], [[gerrit:1071711|Remove unused $wmgPoweredByMediaWikiIcon]], [[gerrit:1071728|Remove unused settings removed in T339959]] (duration: 07m 19s) [20:33:03] T242406: Remove $wgAllowRequiringEmailForResets feature flag [small] - https://phabricator.wikimedia.org/T242406 [20:33:03] T339959: Reduce CentralAuth complexity by removing unused settings - https://phabricator.wikimedia.org/T339959 [20:33:08] MatmaRex: all your patches should be live! [20:33:17] Jdlrobson: i'll do yours next [20:33:20] cjming: thank you! [20:33:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [20:33:34] yw! [20:33:37] thanks @cjming :) [20:34:06] (03Merged) 10jenkins-bot: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [20:34:19] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1063763|Enable the dark mode in Portal namespace (T366380)]] [20:34:24] T366380: Enable portal pages in night theme - https://phabricator.wikimedia.org/T366380 [20:34:25] thanks cjming [20:34:49] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:35:03] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:36:20] !log cjming@deploy1003 ebrahim, cjming: Backport for [[gerrit:1063763|Enable the dark mode in Portal namespace (T366380)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:36:24] Jdlrobson: ready to test - lmk if i should sync [20:38:01] cjming: on it [20:38:09] (03PS4) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [20:38:34] LGTM cjming please sync! [20:38:40] !log cjming@deploy1003 ebrahim, cjming: Continuing with sync [20:38:44] (03PS12) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [20:39:08] (03PS2) 10Superzerocool: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072227 (https://phabricator.wikimedia.org/T374484) [20:39:44] (03CR) 10Jdlrobson: "Nice!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim) [20:40:03] (03PS6) 10Bking: flink-app: customize calico label selector Calico network policies default to matching on "app" label and chartName value, but the flink-kubernetes-operator sets the app label to chartName-release instead. Ref https://lists.apache.org/thread/dont796lp84vfqnovolryw0y0470mqsv [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [20:40:09] (03CR) 10CI reject: [V:04-1] rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [20:40:17] (03PS7) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [20:43:16] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063763|Enable the dark mode in Portal namespace (T366380)]] (duration: 08m 57s) [20:43:20] T366380: Enable portal pages in night theme - https://phabricator.wikimedia.org/T366380 [20:43:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072227 (https://phabricator.wikimedia.org/T374484) (owner: 10Superzerocool) [20:43:35] Jdlrobson: should be live! [20:44:09] (03Merged) 10jenkins-bot: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072227 (https://phabricator.wikimedia.org/T374484) (owner: 10Superzerocool) [20:44:19] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072227|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374484)]] [20:44:23] T374484: Lift IP cap for 190.12.102.194 and 200.5.117.98 on 2024-10-19 - https://phabricator.wikimedia.org/T374484 [20:44:33] <_Gerges> Hi cjming [20:44:41] hi ! [20:45:08] (03CR) 10Dzahn: "PS4 changes:" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [20:45:10] <_Gerges> If this could be edited patch, I don't know how I got it wrong I think it's due to autocomplete [20:45:10] <_Gerges> https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1067433 [20:45:58] Gerges: can you send up a new patch and add it to the deployment cal? i should have time to do it [20:46:16] !log cjming@deploy1003 cjming, superzerocool: Backport for [[gerrit:1072227|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374484)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:46:34] Superzerocool: if your patch can be tested, it's up on mwdebug [20:47:49] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:47:59] <_Gerges> He left [20:48:03] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:48:09] yeah - that's what it looks like [20:48:26] heh. could've been a missclick [20:48:42] anyway Gerges: happy to do one more config patch if you can get it out the door in the next few minutes [20:48:51] (03PS2) 10Ebrahim: Remove ProofreadPage exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 [20:49:01] Superzerocool: shall i sync? [20:49:05] cjming: i don't think there's any way to test an IP cap lift patch anyway, so it seems fine to go ahead, if the details look correct [20:49:14] lgtm - syncing! [20:49:16] !log cjming@deploy1003 cjming, superzerocool: Continuing with sync [20:49:18] hi cjming yep, there is no way to test my patch... [20:49:20] (i mean, not untl the date it happens) [20:50:28] (03PS13) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [20:51:16] Thanks cjming for the help today! [20:51:24] ur welcome! [20:51:52] thanks cjming :)) [20:52:01] yw! [20:52:16] should be live shortly [20:52:39] Gerges: should i wait for your new patch? otherwise i'll close the backport window [20:53:10] _Gerges, I could do your patch, if you want me to [20:53:14] <_Gerges> Wait five minutes [20:53:25] sure - np [20:53:40] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072227|eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon (T374484)]] (duration: 09m 21s) [20:53:40] _Gerges, thank you for the quick response and :) [20:53:44] T374484: Lift IP cap for 190.12.102.194 and 200.5.117.98 on 2024-10-19 - https://phabricator.wikimedia.org/T374484 [20:53:56] (03CR) 10Pppery: "(this was written based off of Patch Set 3, some of this may have been done in Patch Set 4)" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [20:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:55:08] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [20:55:36] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [20:56:38] thanks for your time and service cjming, see you around :) [20:56:51] you're welcome! [20:57:37] (03PS1) 10GergesShamon: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072604 [20:58:07] <_Gerges> Thanks for waiting for me [20:58:18] np! [20:58:38] _Gerges, r u sure you are lifting the cap for a private IP address? [20:58:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072604 (owner: 10GergesShamon) [20:59:22] oops - i already started scap backport for it [20:59:31] (03Merged) 10jenkins-bot: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072604 (owner: 10GergesShamon) [20:59:31] <_Gerges> @Hamishcz: What do you mean? [20:59:44] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072604|Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata]] [21:00:19] _Gerges, https://meta.wikimedia.org/wiki/Mass_account_creation#Requesting_temporary_lift_of_IP_cap [21:01:09] whoops - should i revert? [21:01:46] !log cjming@deploy1003 cjming, gergesshamon: Backport for [[gerrit:1072604|Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:01:48] i guess the original patch might need to be reverted too? [21:02:11] i'm thinking i should not sync this patch - please lmk [21:03:19] <_Gerges> Yes [21:03:20] I recommend revert T373468 related codes, and fix redundant dbname in L59(currently) [21:03:20] T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468 [21:04:28] i'm not sure Gerges what you're saying yes to but i'm going to not sync and revert - that ok? [21:05:23] <_Gerges> not sync and revert [21:05:39] !log cjming@deploy1003 Sync cancelled. [21:06:04] <_Gerges> We need to get a public IP, not a private IP. [21:06:06] (03PS1) 10TrainBranchBot: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072606 [21:06:06] (03CR) 10TrainBranchBot: "cjming@deploy1003 created a revert of this change as Idedb69e4ddf1fb25e1733406a209d12281b57249" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072604 (owner: 10GergesShamon) [21:06:11] (03PS1) 10BryanDavis: toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651) [21:06:44] so Gerges: should we also revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1067433 ? [21:06:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072606 (owner: 10TrainBranchBot) [21:06:56] (03CR) 10CI reject: [V:04-1] toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651) (owner: 10BryanDavis) [21:07:23] (03PS1) 10JHathaway: catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667) [21:07:30] (03Merged) 10jenkins-bot: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072606 (owner: 10TrainBranchBot) [21:07:31] <_Gerges> Yes [21:07:44] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072606|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]] [21:08:03] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [21:08:32] Gerges: ok i will revert the patch from 8/28 and then call it a day [21:09:11] <_Gerges> Sorry for the delay [21:09:13] (03PS1) 10Clare Ming: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072609 [21:09:21] (03CR) 10CI reject: [V:04-1] Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072609 (owner: 10Clare Ming) [21:09:41] !log cjming@deploy1003 cjming, trainbranchbot: Backport for [[gerrit:1072606|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:10:24] and Gerges, I recommend you to contact the author of T373468 to request a new IP, as they have an activity on 17 Sep [21:10:25] T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468 [21:11:16] Gerges: seems like i can't do a quick revert via gerrit UI (merge conflicts) so you'll have to send up a revert patch manually -- or if you think you'll get a non-private IP soon, add a new patch to override the IP [21:11:43] <_Gerges> ok [21:11:47] (03CR) 10CI reject: [V:04-1] catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:11:51] my bad - i didn't think to check documentation when we merged your first patch on 8/28 [21:11:57] quite a busy windows lol cjming [21:12:04] lol - it's true [21:12:12] (03PS2) 10JHathaway: catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667) [21:12:12] !log cjming@deploy1003 cjming, trainbranchbot: Continuing with sync [21:12:37] thanks Hamishcz for catching that - gtk [21:13:17] w/ pleasure :) [21:13:20] (03Abandoned) 10Clare Ming: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072609 (owner: 10Clare Ming) [21:13:35] !log removing 1 file for legal compliance [21:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:56] alrighty since we're over, i'm going to close the window for now [21:14:06] <_Gerges> @cjming: Should I do a patch now to revert? [21:14:18] Gerges: sure! i'll wait if you want to send it up [21:14:29] shouldn't take long [21:14:44] and it looks like there's nothing scheduled after this window so we have time [21:16:41] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072606|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]] (duration: 08m 57s) [21:16:52] (03CR) 10CI reject: [V:04-1] catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:17:31] (03PS3) 10JHathaway: catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667) [21:18:14] (03PS1) 10Scott French: sre.discovery: set timeout in raw dns.query.udp [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) [21:19:03] (03PS2) 10BryanDavis: toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651) [21:20:11] (03PS1) 10GergesShamon: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072614 [21:20:26] !log removing 1 file for legal compliance [21:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:52] (03CR) 10CI reject: [V:04-1] Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072614 (owner: 10GergesShamon) [21:20:52] Gerges: lgtm - i'm going to deploy your revert [21:20:59] <_Gerges> Ok [21:21:07] oops - what's up with CI? [21:21:57] i guess it's space or empty line related lol [21:22:58] ya [21:23:08] Gerges: can you fix? [21:23:12] exactly.. [21:23:23] L37 - https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-test/2841/console [21:23:32] (03PS3) 10BryanDavis: toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651) [21:23:52] (03CR) 10JHathaway: [C:03+2] catalog: use rich_data_json, rather than pson [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072608 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:24:42] <_Gerges> What? [21:24:58] Gerges: just remove empty lines from your patch [21:25:18] (03PS1) 10JHathaway: bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072616 [21:25:41] specifically, remove L37, then everything would be ok [21:25:43] er: i think you can leave one empty line - CI doesn't like 2 empty lines [21:25:49] (03PS2) 10GergesShamon: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072614 [21:26:06] what Hamishcz said [21:27:03] whatever..just leave it [21:27:14] not the code is ok [21:27:17] now* [21:27:49] does that make sense Gerges? otherwise i can do it real quick [21:28:48] (03PS1) 10JHathaway: pcc: bump version on workers [puppet] - 10https://gerrit.wikimedia.org/r/1072617 (https://phabricator.wikimedia.org/T372667) [21:29:31] (03CR) 10BryanDavis: [C:03+2] toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651) (owner: 10BryanDavis) [21:29:33] ah you did it - ok deploying [21:30:00] (03CR) 10JHathaway: [C:03+2] bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072616 (owner: 10JHathaway) [21:30:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072614 (owner: 10GergesShamon) [21:30:08] (03CR) 10JHathaway: [C:03+2] pcc: bump version on workers [puppet] - 10https://gerrit.wikimedia.org/r/1072617 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:30:11] (03PS8) 10Bking: flink-app: customize calico label selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [21:30:34] (03Merged) 10jenkins-bot: toolhub: Add crawler.resources config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072607 (https://phabricator.wikimedia.org/T374651) (owner: 10BryanDavis) [21:30:42] (03Merged) 10jenkins-bot: Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072614 (owner: 10GergesShamon) [21:30:52] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072614|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]] [21:32:52] !log cjming@deploy1003 cjming, gergesshamon: Backport for [[gerrit:1072614|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:33:00] !log cjming@deploy1003 cjming, gergesshamon: Continuing with sync [21:33:27] <_Gerges> Sorry I lost the connection (I'm using IRC Cload, so it didn't appear that I was the only one connected) [21:34:12] (03PS14) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [21:37:29] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072614|Revert "Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata"]] (duration: 06m 37s) [21:37:43] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply [21:39:12] Gerges: no worries - revert is live! [21:39:31] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:39:34] now i will close the window (hopefully there's nothing else) [21:40:46] !log end of UTC late backport window [21:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:40] <_Gerges> Thanks [21:41:51] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10142588 (10thcipriani) >>! In T373969#10130953, @Ladsgroup wrote: > This we... [21:41:52] ur welcome [21:42:20] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142586 (10MBinder_WMF) > The correct combination is phab1004.eqiad.wmnet with bast1003.wikimedia.org. Attached is my verbose output for that combinati... [21:43:59] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply [21:44:04] (03PS1) 10Ebrahim: Remove metawiki dark mode exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 [21:44:42] jouncebot: nowandnext [21:44:42] No deployments scheduled for the next 8 hour(s) and 15 minute(s) [21:44:42] In 8 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240913T0600) [21:44:55] (03PS2) 10Ebrahim: Fix night mode excepted Wikidata namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488 [21:44:57] (03CR) 10Ladsgroup: [C:03+2] Fix night mode excepted Wikidata namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488 (owner: 10Ebrahim) [21:45:17] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142602 (10MBinder_WMF) Ah, I think I might know the problem: my public key file specifies the name of a computer that preceded my current one. I was pr... [21:45:35] (03Merged) 10jenkins-bot: Fix night mode excepted Wikidata namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488 (owner: 10Ebrahim) [21:45:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072488 (owner: 10Ebrahim) [21:45:45] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1072488|Fix night mode excepted Wikidata namespaces]] [21:46:09] (03CR) 10Ebrahim: "Probably this is a rude way to ask for this so pardon me beforehand... but is it possible to reconsider these meta namespaces dark mode ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim) [21:46:40] (03PS2) 10Ebrahim: Remove metawiki dark mode exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 [21:47:23] (03PS1) 10Bartosz Dziewoński: Define MW_ENTRY_POINT in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072624 (https://phabricator.wikimedia.org/T374286) [21:47:42] !log ladsgroup@deploy1003 ladsgroup, ebrahim: Backport for [[gerrit:1072488|Fix night mode excepted Wikidata namespaces]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:48:22] !log ladsgroup@deploy1003 ladsgroup, ebrahim: Continuing with sync [21:51:33] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142618 (10Dzahn) Do you have a file /Users/maxbinder/.ssh/id_ed25519.pub ? (not /Users/maxbinder/.ssh/id_ed25519 the private part, just the public p... [21:51:53] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:52:40] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/toolhub: apply [21:52:54] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072488|Fix night mode excepted Wikidata namespaces]] (duration: 07m 09s) [21:53:38] (03PS2) 10JHathaway: puppet8: ensure kerberos keytab type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) [21:53:41] (03CR) 10Scott French: "If you think using an explicit value here is clearer than "borrowing" the timeout already configured on the stub resolver, I'm happy to re" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072612 (https://phabricator.wikimedia.org/T374047) (owner: 10Scott French) [21:53:50] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [21:53:55] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142620 (10Dzahn) Also, try this: Move the config file out of the way temporarily. Like `mv /Users/maxbinder/.ssh/config /Users/maxbinder/` so it doe... [21:54:00] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [21:54:42] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [21:56:37] !log removing 6 files for legal compliance [21:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:02] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142642 (10Dzahn) You shouldn't have to create a keypair just because your computer name changed. The part at the end is mostly just a comment field. [21:58:03] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142629 (10MBinder_WMF) >>! In T374582#10142618, @Dzahn wrote: > Do you have a file /Users/maxbinder/.ssh/id_ed25519.pub ? (not /Users/maxbinder/.ssh/... [22:00:12] !log removing 1 file for legal compliance [22:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:31] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142645 (10MBinder_WMF) >>! In T374582#10142620, @Dzahn wrote: > Also, try this: > > Move the config file out of the way temporarily. > > Like `mv /U... [22:01:25] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142648 (10Ladsgroup) Can you run the ssh command with -vvvvvvvv (the more "v"s, the better)? [22:01:42] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142649 (10Ladsgroup) but share the result privately, just in case. [22:02:23] (03PS2) 10EoghanGaffney: lists: Switch from ferm to nftables on standby host [puppet] - 10https://gerrit.wikimedia.org/r/1072551 [22:02:39] (03PS3) 10EoghanGaffney: lists: Switch from ferm to nftables on standby host [puppet] - 10https://gerrit.wikimedia.org/r/1072551 [22:02:44] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142650 (10Dzahn) Ok, do this: `ssh-add /Users/maxbinder/.ssh/id_ed25519` It should just ask for a passphrase. If you know it, enter it. Now that key... [22:02:54] (03PS3) 10Ebrahim: Remove ProofreadPage exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 [22:03:21] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142651 (10MBinder_WMF) >>! In T374582#10142649, @Ladsgroup wrote: > but share the result privately, just in case. doc updated with many v's :) [22:04:49] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142652 (10MBinder_WMF) >>! In T374582#10142650, @Dzahn wrote: > Ok, do this: > > `ssh-add /Users/maxbinder/.ssh/id_ed25519` > > It should just ask fo... [22:05:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10142655 (10phaultfinder) [22:05:55] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142656 (10MBinder_WMF) I can successfully log on to phab1004.eqiad.wmnet as well. What was the issue? [22:06:13] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142658 (10Dzahn) >>! In T374582#10142645, @MBinder_WMF wrote: > Output still asked for passphrase: So that is the thing, the passphrase is needed to d... [22:08:02] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142661 (10Ladsgroup) FWIW, `ssh -vvvvvvvvvvvvvvvvvvvv ~/.ssh/id_ed25519 mbinder@bast1003.wikimedia.org` broke because: ` ssh: Could not resolve hostnam... [22:08:18] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142662 (10Dzahn) >>! In T374582#10142656, @MBinder_WMF wrote: > I can successfully log on to phab1004.eqiad.wmnet as well. What was the issue? The key... [22:11:10] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/toolhub: apply [22:11:58] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [22:12:40] (03CR) 10EoghanGaffney: [C:03+2] lists: Switch from ferm to nftables on standby host [puppet] - 10https://gerrit.wikimedia.org/r/1072551 (owner: 10EoghanGaffney) [22:13:37] (03CR) 10Scott French: "Thanks for the review, Hugh!" [puppet] - 10https://gerrit.wikimedia.org/r/1072282 (https://phabricator.wikimedia.org/T374502) (owner: 10Scott French) [22:14:02] (03PS3) 10JHathaway: puppet8: ensure kerberos keytab type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) [22:14:14] (03Abandoned) 10Scott French: aptrepo: ffmpeg bullseye component [puppet] - 10https://gerrit.wikimedia.org/r/1072282 (https://phabricator.wikimedia.org/T374502) (owner: 10Scott French) [22:15:10] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142682 (10MBinder_WMF) Hmm, I'm pretty sure I never had to enter a passphrase for each login in the past, but I might be mistaken. Also, when I was pro... [22:16:14] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142711 (10Dzahn) This probably has to do with getting your new computer. Likely you had this key added to some kind of key chain or app provided by the... [22:18:09] (03CR) 10Jdlrobson: "I would suggest following https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests#Lifecycle_of_a_request and asking the community. If " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim) [22:18:26] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142727 (10MBinder_WMF) >>! In T374582#10142711, @Dzahn wrote: > This probably has to do with getting your new computer. Likely you had this key added t... [22:19:57] PROBLEM - Host lists2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:21:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:21:19] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [22:21:29] RECOVERY - Host lists2001 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [22:23:09] (03CR) 10Cwhite: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [22:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:24:35] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142747 (10MBinder_WMF) Ah, you know what? I think it did, in fact, work. I just didn't realize that I needed to enter it twice, and assumed that the re... [22:24:55] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [22:25:11] (03CR) 10Cwhite: [C:03+2] ci: define statsd prometheus exporter mappings for zuul [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [22:26:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:28:34] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@6e810dc] (releasing): (no justification provided) [22:29:01] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10142750 (10Ladsgroup) 05Open→03Resolved a:03MBinder_WMF Don't know Mac but in Linux you can set it to "Remember the key passphrase" and it wou... [22:30:17] !log dduvall@deploy1003 deploy aborted: (no justification provided) (duration: 01m 43s) [22:31:50] FIRING: ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:32:41] PROBLEM - jenkins_service_running on releases1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [22:32:42] ^ sorry, that's me. fixing [22:33:53] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@6e810dc] (releasing): (no justification provided) [22:34:15] (03PS5) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [22:34:21] (03CR) 10BCornwall: "Thanks for all the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [22:34:21] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (CORE_DIFF 24 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [22:34:28] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@6e810dc] (releasing): (no justification provided) (duration: 00m 34s) [22:34:41] RECOVERY - jenkins_service_running on releases1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [22:36:46] (03PS1) 10JHathaway: fix rich data keys [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072628 (https://phabricator.wikimedia.org/T372667) [22:36:47] (03PS1) 10JHathaway: bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072629 [22:36:50] RESOLVED: [3x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:41:03] 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis - https://phabricator.wikimedia.org/T374673#10142805 (10Dzahn) [22:42:05] 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis - https://phabricator.wikimedia.org/T374673#10142806 (10Dzahn) Thanks for this @KFrancis Tagged it and keeping it open right now to make people aware handling requests in this time. Enjoy vacation! [22:46:30] (03CR) 10JHathaway: [C:03+2] fix rich data keys [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072628 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [22:46:38] (03CR) 10JHathaway: [C:03+2] bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072629 (owner: 10JHathaway) [22:46:53] (03CR) 10JHathaway: [V:03+2 C:03+2] bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1072629 (owner: 10JHathaway) [22:49:07] (03PS1) 10JHathaway: pcc: bump version on workers, again :( [puppet] - 10https://gerrit.wikimedia.org/r/1072631 (https://phabricator.wikimedia.org/T372667) [22:50:41] (03CR) 10JHathaway: [C:03+2] pcc: bump version on workers, again :( [puppet] - 10https://gerrit.wikimedia.org/r/1072631 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [22:59:55] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@35befba] (releasing): (no justification provided) [23:00:34] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@35befba] (releasing): (no justification provided) (duration: 00m 38s) [23:03:55] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:30] (03PS2) 10Bartosz Dziewoński: Improve $wgFooterIcons override, remove $wmgWikimediaIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 [23:20:52] (03PS1) 10Cwhite: zuul: set statsd-exporter to relay to local statsite instance [puppet] - 10https://gerrit.wikimedia.org/r/1072632 (https://phabricator.wikimedia.org/T233089) [23:20:54] (03PS1) 10Cwhite: zuul: send stats to prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1072633 (https://phabricator.wikimedia.org/T233089) [23:21:16] (03CR) 10CI reject: [V:04-1] zuul: set statsd-exporter to relay to local statsite instance [puppet] - 10https://gerrit.wikimedia.org/r/1072632 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [23:23:35] (03PS2) 10Cwhite: zuul: set statsd-exporter to relay to local statsite instance [puppet] - 10https://gerrit.wikimedia.org/r/1072632 (https://phabricator.wikimedia.org/T233089) [23:23:35] (03PS2) 10Cwhite: zuul: send stats to prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1072633 (https://phabricator.wikimedia.org/T233089) [23:27:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#10142869 (10Papaul) 05Open→03Resolved a:03Papaul Since we know now what the issue is and we have a fix I am closing this task but feel free to... [23:27:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10142876 (10Papaul) a:03Papaul [23:28:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10142877 (10Papaul) a:03Papaul [23:31:34] PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:36:34] PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:37:31] PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:37:32] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072635 [23:38:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072635 (owner: 10TrainBranchBot) [23:46:34] RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:47:30] RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:48:55] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed