[00:05:28] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109518 [00:38:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109518 (owner: 10TrainBranchBot) [00:45:19] 06SRE, 10Observability-Alerting: Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570#10446596 (10andrea.denisse) Related to T264016, and T263423. [00:50:19] 06SRE, 10Observability-Alerting: smart-data-dump should fail loudly when it can't gather metrics - https://phabricator.wikimedia.org/T267135#10446605 (10andrea.denisse) Hi team, I was wondering if this is still relevant as from the HW side we're no longer planning on using HP hosts. What do you think? [00:53:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:53:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:54:19] 06SRE, 06Infrastructure-Foundations, 10Observability-Alerting: Improve alerting for hosts with Puppet disabled for longer periods - https://phabricator.wikimedia.org/T277083#10446643 (10andrea.denisse) [00:56:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109518 (owner: 10TrainBranchBot) [01:04:32] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:08:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109522 [01:08:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109522 (owner: 10TrainBranchBot) [01:09:15] 10SRE-tools, 10Observability-Logging: Create a cookbook for managing Logstash cluster restarts - https://phabricator.wikimedia.org/T293929#10446682 (10colewhite) [01:11:32] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:16:25] 10SRE-tools, 10Observability-Logging: Create a cookbook for managing Logstash cluster restarts - https://phabricator.wikimedia.org/T293929#10446694 (10colewhite) [01:19:30] 14SRE-Sprint-Week-Sustainability-March2023, 10MediaWiki-General, 10observability, 10Observability-Logging, and 2 others: MediaWiki log spam during row D blip / rack D2 unavailable - https://phabricator.wikimedia.org/T233739#10446707 (10colewhite) 05Open→03Declined Boldly declining in favor of focus... [01:20:21] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10446713 (10Platonides) Out of 4637 test emails sent in December (from 18th to 31st), 100% of them were held correctly. The "Message content... [01:28:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109522 (owner: 10TrainBranchBot) [01:38:32] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:48:28] 06SRE, 06Infrastructure-Foundations, 10Observability-Logging, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastructure - https://phabricator.wikimedia.org/T347565#10446744 (10andrea.denisse) Hi team, I was looking at the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/... [01:55:52] (03PS1) 10Ladsgroup: mediawiki: Add Uncategorizedpages cron for commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1109526 (https://phabricator.wikimedia.org/T369024) [01:57:38] (03PS1) 10Ladsgroup: mediawiki: Remove special-case wikitech update query page runs [puppet] - 10https://gerrit.wikimedia.org/r/1109527 [01:57:53] (03CR) 10CI reject: [V:04-1] mediawiki: Add Uncategorizedpages cron for commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1109526 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup) [02:01:30] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:02:32] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:04:30] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:34:32] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:37:30] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:37:32] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:38:18] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 06Traffic: Webrequest x_analtics `wprov` value is incorrectly formatted - https://phabricator.wikimedia.org/T339910#10446772 (10Ottomata) [02:38:32] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:38:44] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 06Traffic, and 3 others: Include User-Agent Client Hints in WebRequest logs - https://phabricator.wikimedia.org/T337947#10446775 (10Ottomata) [02:45:51] 07Puppet, 06Data-Engineering, 06Data-Engineering-Icebox, 10observability: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#10447041 (10Ottomata) [02:46:07] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 06Traffic-Icebox: Investigate and fix odd uri_host values - https://phabricator.wikimedia.org/T188804#10447045 (10Ottomata) [02:46:21] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 06Privacy Engineering, 06Traffic: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990#10447046 (10Ottomata) [02:47:54] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 06Traffic-Icebox, 07Privacy: Add request_id to webrequest logs as well as other event records ingested into Hadoop - https://phabricator.wikimedia.org/T113817#10447064 (10Ottomata) [02:55:32] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:00:18] (03Restored) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1104740 (owner: 10Pppery) [03:16:32] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:25:32] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:33:24] (03PS2) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1104740 [03:46:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:04:22] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:05:28] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:30] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:09:30] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:34:16] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:45:16] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:25:32] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:49:43] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383 (10phaultfinder) 03NEW [05:50:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:53:39] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, 10Move-Files-To-Commons: Error using FileImporter and undelete file on Commons because of "local-multiwrite/local-public...is in an inconsistent state within the int... - https://phabricator.wikimedia.org/T382715#10447221 [06:23:41] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10447251 (10phaultfinder) [06:38:32] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:52:22] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10447254 (10phaultfinder) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250110T0700) [07:05:42] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:08:56] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:09:16] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:11:42] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53369 bytes in 9.943 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:13:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:20:32] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:45:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:51:32] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250110T0800) [08:01:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:05:28] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:30] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:09:30] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:13:32] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:14:22] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:15:33] !log homer 'cr*eqiad*' commit 'T377876' [08:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:36] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [08:20:59] !log homer 'lsw1-e3-eqiad*' commit 'T377876' [08:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:02] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [08:25:33] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1057.eqiad.wmnet [08:25:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1057.eqiad.wmnet [08:30:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1057.eqiad.wmnet - https://phabricator.wikimedia.org/T381676#10447286 (10Jelto) Thanks @Jclark-ctr for the quick help and running the reimage one more time. The host looks good to me now. I exe... [08:31:50] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1069.eqiad.wmnet [08:31:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1069.eqiad.wmnet [08:32:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T381770#10447291 (10Jelto) Thanks @Jclark-ctr for the quick help and running the reimage one more time. The host looks goo... [08:33:29] (03CR) 10Stevemunene: Make WikibaseQualityConstraints use split-graph query service (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) (owner: 10Stevemunene) [08:34:26] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1073.eqiad.wmnet [08:34:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1073.eqiad.wmnet [08:34:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10447296 (10Jelto) Thanks @Jclark-ctr for the quick help and running the reimage one more time. The host looks goo... [08:36:52] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1081.eqiad.wmnet [08:36:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1081.eqiad.wmnet [08:38:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10447301 (10Jelto) Thanks @Jclark-ctr for the quick help and running the reimage one more time. The host looks goo... [08:39:41] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1243.eqiad.wmnet [08:39:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1243.eqiad.wmnet [08:39:56] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T383051#10447308 (10Jelto) Thanks @Jclark-ctr for the quick help and running the reimage one more time. The host lo... [08:43:16] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:44:16] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:00:52] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [09:05:38] (03PS1) 10Ilias Sarantopoulos: api-gateway: add reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109666 (https://phabricator.wikimedia.org/T378495) [09:06:24] (03CR) 10Filippo Giunchedi: prometheus: migrate ops instance to prometheus::instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:10:15] (03PS6) 10Filippo Giunchedi: prometheus: migrate ops instance to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) [09:10:15] (03PS8) 10Filippo Giunchedi: prometheus: k8s instances migration to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) [09:13:13] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: migrate ops instance to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:19:51] !log homer 'lsw1-c6-codfw*' commit 'T377877' [09:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:54] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [09:20:50] RECOVERY - BGP status on lsw1-c6-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:21:12] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2022.codfw.wmnet [09:21:13] (03PS1) 10Brouberol: airflow-research: disable the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1109667 (https://phabricator.wikimedia.org/T380615) [09:21:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2022.codfw.wmnet [09:21:22] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:22:39] (03PS2) 10Brouberol: airflow-research: disable the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1109667 (https://phabricator.wikimedia.org/T380615) [09:23:23] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4777/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109667 (https://phabricator.wikimedia.org/T380615) (owner: 10Brouberol) [09:24:22] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:25:46] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2049-2052].codfw.wmnet [09:27:33] (03CR) 10Btullis: [C:03+1] airflow-research: disable the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1109667 (https://phabricator.wikimedia.org/T380615) (owner: 10Brouberol) [09:28:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2049-2052].codfw.wmnet [09:29:32] PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10447363 (10phaultfinder) [09:29:54] (03CR) 10Jelto: [C:03+2] Rename kubernetes20[49-52] to wikikube-worker219[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1109453 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [09:32:59] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2049 to wikikube-worker2195 [09:33:20] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:34:18] (03CR) 10Kevin Bazira: [C:03+1] api-gateway: add reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109666 (https://phabricator.wikimedia.org/T378495) (owner: 10Ilias Sarantopoulos) [09:36:46] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2049 to wikikube-worker2195 - jelto@cumin1002" [09:37:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10447374 (10Marostegui) →14Duplicate dup:03T382569 [09:37:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10447375 (10Marostegui) 05Duplicate→03Open Sorry my mistake. This task is still open and valid. [09:37:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2049 to wikikube-worker2195 - jelto@cumin1002" [09:37:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:37:51] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2195 [09:38:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10447377 (10Marostegui) [09:38:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2195 [09:38:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2049 to wikikube-worker2195 [09:39:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10447380 (10Marostegui) @VRiley-WMF You can proceed with this host anytime, I will work on the others one and leave this one for the last one. [09:39:37] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2050 to wikikube-worker2196 [09:39:57] !log `apt-get clean` on an-worker1117 to free space on the root partition [09:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:58] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:40:50] !log `apt-get clean` on an-worker1147 to free space on the root partition [09:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:22] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:42:32] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:43:29] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2050 to wikikube-worker2196 - jelto@cumin1002" [09:43:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on kubernetes2051:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:44:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2050 to wikikube-worker2196 - jelto@cumin1002" [09:44:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:44:25] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2196 [09:44:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2196 [09:45:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2050 to wikikube-worker2196 [09:45:46] !log elukey@cumin1002:~$ sudo cumin 'an-worker11[16,43,19,47,56,72,69]*' 'apt-get clean' [09:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:57] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2051 to wikikube-worker2197 [09:45:58] (03CR) 10Brouberol: [V:03+1 C:03+2] airflow-research: disable the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1109667 (https://phabricator.wikimedia.org/T380615) (owner: 10Brouberol) [09:46:18] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:47:30] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:47:33] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Make WikibaseQualityConstraints use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) (owner: 10Stevemunene) [09:49:39] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2051 to wikikube-worker2197 - jelto@cumin1002" [09:49:44] !log elukey@cumin1002:~$ sudo cumin 'an-worker11[39,15,54,90,75,57,89,18,06,24]*' 'apt-get clean' [09:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2051 to wikikube-worker2197 - jelto@cumin1002" [09:50:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:50:05] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2197 [09:50:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2197 [09:50:32] RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:50:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2051 to wikikube-worker2197 [09:51:10] RECOVERY - Disk space on an-worker1124 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1124&var-datasource=eqiad+prometheus/ops [09:51:34] RECOVERY - Disk space on an-worker1117 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1117&var-datasource=eqiad+prometheus/ops [09:51:36] RECOVERY - Disk space on an-worker1169 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1169&var-datasource=eqiad+prometheus/ops [09:51:40] RECOVERY - Disk space on an-worker1154 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1154&var-datasource=eqiad+prometheus/ops [09:51:40] RECOVERY - Disk space on an-worker1139 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops [09:51:40] RECOVERY - Disk space on an-worker1157 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1157&var-datasource=eqiad+prometheus/ops [09:51:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P71957 and previous config saved to /var/cache/conftool/dbconfig/20250110-095141-root.json [09:51:44] RECOVERY - Disk space on an-worker1172 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1172&var-datasource=eqiad+prometheus/ops [09:51:50] RECOVERY - Disk space on an-worker1156 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1156&var-datasource=eqiad+prometheus/ops [09:51:54] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2052 to wikikube-worker2198 [09:52:16] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:52:30] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:54:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2132,2160,2232].codfw.wmnet with reason: maintenance [09:54:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2132,2160,2232].codfw.wmnet with reason: maintenance [09:54:50] RECOVERY - Disk space on an-worker1147 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1147&var-datasource=eqiad+prometheus/ops [09:54:56] RECOVERY - Disk space on an-worker1116 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1116&var-datasource=eqiad+prometheus/ops [09:55:10] RECOVERY - Disk space on an-worker1119 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1119&var-datasource=eqiad+prometheus/ops [09:55:46] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2052 to wikikube-worker2198 - jelto@cumin1002" [09:55:52] (03PS1) 10Marostegui: mariadb: Productionize db2232 [puppet] - 10https://gerrit.wikimedia.org/r/1109668 (https://phabricator.wikimedia.org/T373579) [09:56:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2052 to wikikube-worker2198 - jelto@cumin1002" [09:56:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:56:03] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2198 [09:56:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2198 [09:56:30] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2232 [puppet] - 10https://gerrit.wikimedia.org/r/1109668 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [09:56:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2052 to wikikube-worker2198 [09:57:03] !log kill hanging jupyterhub process on stat1009 to allow puppet to run an delete a user [09:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:33] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2195.codfw.wmnet wikikube-worker2196.codfw.wmnet wikikube-worker2197.codfw.wmnet wikikube-worker2198.codfw.wmnet on all recursors [09:57:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2195.codfw.wmnet wikikube-worker2196.codfw.wmnet wikikube-worker2197.codfw.wmnet wikikube-worker2198.codfw.wmnet on all recursors [09:58:14] RECOVERY - Disk space on an-worker1143 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1143&var-datasource=eqiad+prometheus/ops [10:00:33] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2195.codfw.wmnet with OS bookworm [10:00:36] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2196.codfw.wmnet with OS bookworm [10:00:43] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2195 [10:00:46] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2196 [10:01:17] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:01:20] RECOVERY - Disk space on an-worker1106 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1106&var-datasource=eqiad+prometheus/ops [10:01:20] RECOVERY - Disk space on an-worker1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1115&var-datasource=eqiad+prometheus/ops [10:01:30] (03CR) 10Elukey: "@aotto@wikimedia.org: hi! Could you please clean up all the systemd units on eventlog1003? They show up as alarms, see https://alerts.wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1109157 (owner: 10Ottomata) [10:02:44] !log elukey@cumin1002:~$ sudo cumin -b 20 'an-worker*' 'apt-get clean' (safety to free space and avoid issues on hadoop) - T383320 [10:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:47] T383320: Low disk space on the root partition for several Hadoop workers - https://phabricator.wikimedia.org/T383320 [10:02:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2128 db2228 T373579', diff saved to https://phabricator.wikimedia.org/P71959 and previous config saved to /var/cache/conftool/dbconfig/20250110-100248-marostegui.json [10:02:52] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [10:03:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2128,2186,2228].codfw.wmnet with reason: maintenance [10:03:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2128,2186,2228].codfw.wmnet with reason: maintenance [10:04:54] RECOVERY - Disk space on an-worker1118 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [10:05:01] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2196 - jelto@cumin1002" [10:05:02] RECOVERY - Disk space on an-worker1110 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [10:05:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2196 - jelto@cumin1002" [10:05:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:05:05] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2196.codfw.wmnet 224.48.192.10.in-addr.arpa 4.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:05:07] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:05:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2196.codfw.wmnet 224.48.192.10.in-addr.arpa 4.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:05:09] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2196 [10:05:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2196 [10:05:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2196 [10:06:02] !log restart dump_cloud_ip_ranges on puppetserver1001 - unit failed due to errors while fetching new data from upstream, trying to see if it was a temporary issue [10:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71960 and previous config saved to /var/cache/conftool/dbconfig/20250110-100611-root.json [10:06:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71961 and previous config saved to /var/cache/conftool/dbconfig/20250110-100621-root.json [10:06:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P71962 and previous config saved to /var/cache/conftool/dbconfig/20250110-100646-root.json [10:07:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:07:28] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2195.codfw.wmnet 225.48.192.10.in-addr.arpa 5.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:07:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2195.codfw.wmnet 225.48.192.10.in-addr.arpa 5.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:07:31] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2195 [10:07:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2195 [10:07:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2195 [10:11:36] (03PS1) 10Marostegui: monitoring.yaml: Change check host. [puppet] - 10https://gerrit.wikimedia.org/r/1109669 (https://phabricator.wikimedia.org/T373579) [10:11:40] RECOVERY - Disk space on an-worker1090 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1090&var-datasource=eqiad+prometheus/ops [10:11:46] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1109669 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:12:17] (03CR) 10Marostegui: [C:04-2] "Jaime, is this ready to go anytime from your side once the host is cloned or should I coordinate with you before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1109669 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:17:18] (03CR) 10Jcrespo: "Let me quickly double check it would work without issues." [puppet] - 10https://gerrit.wikimedia.org/r/1109669 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:17:42] (03CR) 10Marostegui: [C:04-2] "Give me a minute, the host is finishing its cloning" [puppet] - 10https://gerrit.wikimedia.org/r/1109669 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:18:08] RECOVERY - Disk space on an-worker1089 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1089&var-datasource=eqiad+prometheus/ops [10:19:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1022.eqiad.wmnet with reason: maintenance [10:19:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1022.eqiad.wmnet with reason: maintenance [10:21:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P71963 and previous config saved to /var/cache/conftool/dbconfig/20250110-102116-root.json [10:21:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P71964 and previous config saved to /var/cache/conftool/dbconfig/20250110-102126-root.json [10:21:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P71965 and previous config saved to /var/cache/conftool/dbconfig/20250110-102152-root.json [10:23:06] (03CR) 10Jcrespo: "No prob. Let me downtime the associated checks just in case meanwhile, in case something could go wrong during the migration." [puppet] - 10https://gerrit.wikimedia.org/r/1109669 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:24:01] (03PS1) 10Brouberol: airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109670 (https://phabricator.wikimedia.org/T380620) [10:25:48] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2196.codfw.wmnet with reason: host reimage [10:28:00] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2195.codfw.wmnet with reason: host reimage [10:28:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2196.codfw.wmnet with reason: host reimage [10:30:32] (03CR) 10Jcrespo: [C:03+1] "So, after checking, this will be a noop, as it only configures eventual monitoring hosts on codfw, which we have none at the time (it is v" [puppet] - 10https://gerrit.wikimedia.org/r/1109669 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:30:42] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1093-1095].eqiad.wmnet [10:30:44] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1093-1095].eqiad.wmnet [10:31:41] (03PS1) 10JMeybohm: Update to calico v3.29.1 [debs/calico] (v3.29) - 10https://gerrit.wikimedia.org/r/1109671 (https://phabricator.wikimedia.org/T341984) [10:31:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2195.codfw.wmnet with reason: host reimage [10:32:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383213#10447459 (10kamila) [10:32:45] (03PS4) 10Cathal Mooney: Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) [10:33:19] (03CR) 10Marostegui: [C:04-2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1109669 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:34:46] (03PS1) 10JMeybohm: Update to kubernetes v1.31.4 [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1109672 (https://phabricator.wikimedia.org/T341984) [10:35:41] (03PS2) 10JMeybohm: Update to kubernetes v1.31.4 [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1109672 (https://phabricator.wikimedia.org/T341984) [10:35:43] (03PS1) 10Marostegui: db2232: Get pt-heartbeat running [puppet] - 10https://gerrit.wikimedia.org/r/1109673 (https://phabricator.wikimedia.org/T373579) [10:35:43] (03CR) 10Marostegui: [C:03+2] monitoring.yaml: Change check host. [puppet] - 10https://gerrit.wikimedia.org/r/1109669 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:36:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P71966 and previous config saved to /var/cache/conftool/dbconfig/20250110-103622-root.json [10:36:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P71967 and previous config saved to /var/cache/conftool/dbconfig/20250110-103632-root.json [10:36:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P71968 and previous config saved to /var/cache/conftool/dbconfig/20250110-103657-root.json [10:38:17] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10447475 (10Gehel) [10:39:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 2 others: Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10447497 (10Gehel) [10:40:28] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:41:00] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10447541 (10Gehel) [10:41:06] 07sre-alert-triage, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10447553 (10Gehel) [10:41:12] 07sre-alert-triage, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Alert in need of triage: SmartNotHealthy (instance dse-k8s-worker1009:9100) - https://phabricator.wikimedia.org/T382871#10447551 (10Gehel) [10:44:55] (03CR) 10Marostegui: [C:03+2] db2232: Get pt-heartbeat running [puppet] - 10https://gerrit.wikimedia.org/r/1109673 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:45:13] (03CR) 10CI reject: [V:04-1] Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [10:45:28] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:47:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2123 T383388', diff saved to https://phabricator.wikimedia.org/P71969 and previous config saved to /var/cache/conftool/dbconfig/20250110-104739-marostegui.json [10:47:43] T383388: decommission db2123.codfw.wmnet - https://phabricator.wikimedia.org/T383388 [10:49:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2196.codfw.wmnet with OS bookworm [10:51:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P71970 and previous config saved to /var/cache/conftool/dbconfig/20250110-105127-root.json [10:51:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P71971 and previous config saved to /var/cache/conftool/dbconfig/20250110-105137-root.json [10:51:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2195.codfw.wmnet with OS bookworm [10:52:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P71972 and previous config saved to /var/cache/conftool/dbconfig/20250110-105202-root.json [10:53:27] (03PS1) 10Marostegui: db2191: Make it s5 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1109678 (https://phabricator.wikimedia.org/T374951) [10:54:11] (03CR) 10CI reject: [V:04-1] airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109670 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [10:54:35] (03CR) 10Marostegui: [C:03+2] db2191: Make it s5 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1109678 (https://phabricator.wikimedia.org/T374951) (owner: 10Marostegui) [10:54:46] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2197.codfw.wmnet with OS bookworm [10:54:47] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2198.codfw.wmnet with OS bookworm [10:54:57] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2197 [10:54:57] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2198 [10:55:14] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:55:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2192 to change binlog format', diff saved to https://phabricator.wikimedia.org/P71973 and previous config saved to /var/cache/conftool/dbconfig/20250110-105514-marostegui.json [10:55:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2192.codfw.wmnet with reason: maintenance [10:55:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2192.codfw.wmnet with reason: maintenance [10:56:36] (03CR) 10Marostegui: [C:03+2] "This was db2192, commit message was wrong" [puppet] - 10https://gerrit.wikimedia.org/r/1109678 (https://phabricator.wikimedia.org/T374951) (owner: 10Marostegui) [10:57:19] (03PS5) 10Cathal Mooney: Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) [10:58:42] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2198 - jelto@cumin1002" [10:58:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2198 - jelto@cumin1002" [10:58:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:58:47] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2198.codfw.wmnet 222.48.192.10.in-addr.arpa 2.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:58:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2198.codfw.wmnet 222.48.192.10.in-addr.arpa 2.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:58:51] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2198 [10:59:03] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:59:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2198 [10:59:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2198 [11:00:54] (03PS1) 10Marostegui: db2123: No longer candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1109679 (https://phabricator.wikimedia.org/T383388) [11:01:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:01:22] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2197.codfw.wmnet 223.48.192.10.in-addr.arpa 3.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:01:23] (03CR) 10Marostegui: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1109669 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [11:01:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2197.codfw.wmnet 223.48.192.10.in-addr.arpa 3.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:01:26] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2197 [11:01:31] (03CR) 10Marostegui: [C:03+2] db2123: No longer candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1109679 (https://phabricator.wikimedia.org/T383388) (owner: 10Marostegui) [11:02:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2197 [11:02:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2197 [11:06:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P71974 and previous config saved to /var/cache/conftool/dbconfig/20250110-110633-root.json [11:06:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P71975 and previous config saved to /var/cache/conftool/dbconfig/20250110-110643-root.json [11:07:22] (03CR) 10CI reject: [V:04-1] Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [11:11:00] (03PS1) 10Filippo Giunchedi: prometheus: add initial lv size to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) [11:14:28] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4778/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [11:19:06] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2198.codfw.wmnet with reason: host reimage [11:21:56] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2197.codfw.wmnet with reason: host reimage [11:22:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2198.codfw.wmnet with reason: host reimage [11:26:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2197.codfw.wmnet with reason: host reimage [11:29:12] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:32:02] (03CR) 10JMeybohm: prometheus: k8s instances migration to prometheus::instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [11:35:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:41:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2198.codfw.wmnet with OS bookworm [11:44:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71976 and previous config saved to /var/cache/conftool/dbconfig/20250110-114417-root.json [11:46:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2197.codfw.wmnet with OS bookworm [11:46:48] 06SRE, 06Traffic, 05WMF-NDA: Define a schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392 (10Fabfur) 03NEW [11:49:21] !log homer 'lsw1-d5-codfw*' commit 'T377877' [11:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:25] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [11:50:04] !log homer 'lsw1-d8-codfw*' commit 'T377877' [11:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:53] !log homer 'lsw1-d6-codfw*' commit 'T377877' [11:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:38] !log homer 'cr*eqiad*' commit 'T377876' [11:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:41] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [11:51:49] !log homer 'cr*codw*' commit 'T377877' [11:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:13] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [11:53:32] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 154, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:54:07] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2195-2198].codfw.wmnet [11:54:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2195-2198].codfw.wmnet [11:54:58] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341#10447829 (10Jelto) [11:59:13] (03PS1) 10Jelto: Rename kubernetes20[45-48] to wikikube-worker[2199-2202] [puppet] - 10https://gerrit.wikimedia.org/r/1109691 (https://phabricator.wikimedia.org/T377877) [11:59:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P71977 and previous config saved to /var/cache/conftool/dbconfig/20250110-115922-root.json [11:59:24] (03PS1) 10Marostegui: db2126: No longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/1109692 (https://phabricator.wikimedia.org/T374623) [11:59:44] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1109692 (https://phabricator.wikimedia.org/T374623) (owner: 10Marostegui) [11:59:55] (03CR) 10Marostegui: [C:03+2] db2126: No longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/1109692 (https://phabricator.wikimedia.org/T374623) (owner: 10Marostegui) [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250110T0800) [12:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250110T1200). [12:05:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:07:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [12:07:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:38:54] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20[45-48] to wikikube-worker[2199-2202] [puppet] - 10https://gerrit.wikimedia.org/r/1109691 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [12:41:47] 06SRE, 06Traffic, 05WMF-NDA: Define a schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10447951 (10gmodena) [12:44:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P71982 and previous config saved to /var/cache/conftool/dbconfig/20250110-124438-root.json [12:58:38] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2045-2048].codfw.wmnet [13:00:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2045-2048].codfw.wmnet [13:01:34] (03CR) 10Jelto: [C:03+2] Rename kubernetes20[45-48] to wikikube-worker[2199-2202] [puppet] - 10https://gerrit.wikimedia.org/r/1109691 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [13:04:46] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2045 to wikikube-worker2199 [13:05:07] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:05:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [13:05:43] status [13:07:39] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [13:07:39] status [13:08:28] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2045 to wikikube-worker2199 - jelto@cumin1002" [13:08:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2045 to wikikube-worker2199 - jelto@cumin1002" [13:08:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:08:46] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2199 [13:09:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2199 [13:09:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2045 to wikikube-worker2199 [13:10:35] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2046 to wikikube-worker2200 [13:10:57] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:11:13] (03CR) 10Filippo Giunchedi: prometheus: k8s instances migration to prometheus::instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:12:22] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108726 (owner: 10PipelineBot) [13:12:30] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108456 (owner: 10PipelineBot) [13:12:36] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099658 (owner: 10PipelineBot) [13:14:16] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2046 to wikikube-worker2200 - jelto@cumin1002" [13:15:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on kubernetes2047:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:15:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:15:49] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2200 [13:16:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2200 [13:17:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2046 to wikikube-worker2200 [13:17:51] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2047 to wikikube-worker2201 [13:18:11] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:20:55] (03CR) 10JMeybohm: "Indeed a bit hard to read due to the different indentation level. LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:21:38] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2047 to wikikube-worker2201 - jelto@cumin1002" [13:22:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2047 to wikikube-worker2201 - jelto@cumin1002" [13:22:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:22:52] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2201 [13:23:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2201 [13:23:46] (03PS1) 10Marostegui: mariadb: Update pc3 situation [puppet] - 10https://gerrit.wikimedia.org/r/1109702 (https://phabricator.wikimedia.org/T383398) [13:23:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2047 to wikikube-worker2201 [13:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10448017 (10phaultfinder) [13:24:38] (03PS2) 10Marostegui: mariadb: Update pc3 situation [puppet] - 10https://gerrit.wikimedia.org/r/1109702 (https://phabricator.wikimedia.org/T383398) [13:24:59] (03CR) 10Marostegui: "This change is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1109702 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [13:25:06] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2048 to wikikube-worker2202 [13:25:27] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:25:43] (03CR) 10Marostegui: [C:03+2] mariadb: Update pc3 situation [puppet] - 10https://gerrit.wikimedia.org/r/1109702 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [13:26:53] (03CR) 10Filippo Giunchedi: "Thank you for the quick review Janis!" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:28:58] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2048 to wikikube-worker2202 - jelto@cumin1002" [13:29:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2048 to wikikube-worker2202 - jelto@cumin1002" [13:29:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:29:18] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2202 [13:29:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2202 [13:30:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2048 to wikikube-worker2202 [13:30:28] !log Move pc1013 to pc3 dbmaint eqiad - T383398 [13:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:33] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [13:32:00] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2199.codfw.wmnet wikikube-worker2200.codfw.wmnet wikikube-worker2201.codfw.wmnet wikikube-worker2202.codfw.wmnet on all recursors [13:32:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2199.codfw.wmnet wikikube-worker2200.codfw.wmnet wikikube-worker2201.codfw.wmnet wikikube-worker2202.codfw.wmnet on all recursors [13:33:32] (03PS2) 10Brouberol: airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109670 (https://phabricator.wikimedia.org/T380620) [13:34:59] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2199.codfw.wmnet with OS bookworm [13:34:59] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2200.codfw.wmnet with OS bookworm [13:35:09] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2199 [13:35:10] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2200 [13:36:17] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:36:55] (03PS1) 10Marostegui: pc2014: Move it to pc4 [puppet] - 10https://gerrit.wikimedia.org/r/1109703 (https://phabricator.wikimedia.org/T383398) [13:37:34] (03CR) 10Marostegui: [C:03+2] pc2014: Move it to pc4 [puppet] - 10https://gerrit.wikimedia.org/r/1109703 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [13:38:27] !log Move pc2013 to pc4 dbmaint codfw - T383398 [13:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:30] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [13:40:16] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2199 - jelto@cumin1002" [13:40:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2199 - jelto@cumin1002" [13:40:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:40:21] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2199.codfw.wmnet 229.48.192.10.in-addr.arpa 9.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:40:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2199.codfw.wmnet 229.48.192.10.in-addr.arpa 9.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:40:24] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2199 [13:40:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2199 [13:40:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2199 [13:40:55] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:41:52] (03PS3) 10Brouberol: airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109670 (https://phabricator.wikimedia.org/T380620) [13:43:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:43:12] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2200.codfw.wmnet 228.48.192.10.in-addr.arpa 8.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:43:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2200.codfw.wmnet 228.48.192.10.in-addr.arpa 8.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:43:16] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2200 [13:43:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2200 [13:43:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2200 [13:44:07] (03PS1) 10JMeybohm: k8s::package: Install version specific kubernetes-client package [puppet] - 10https://gerrit.wikimedia.org/r/1109704 (https://phabricator.wikimedia.org/T341984) [13:44:16] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109704 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:44:33] (03CR) 10CI reject: [V:04-1] k8s::package: Install version specific kubernetes-client package [puppet] - 10https://gerrit.wikimedia.org/r/1109704 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:45:49] (03PS2) 10JMeybohm: k8s::package: Install version specific kubernetes-client package [puppet] - 10https://gerrit.wikimedia.org/r/1109704 (https://phabricator.wikimedia.org/T341984) [13:46:34] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109704 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:47:32] (03PS1) 10Brouberol: airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620) [13:47:47] PROBLEM - MariaDB Replica SQL: s7 #page on db1170 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: huwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:48:07] another corruption [13:48:10] <_joe_> yes [13:48:16] <_joe_> !incidents [13:48:16] 5585 (UNACKED) db1170 (paged)/MariaDB Replica SQL: s7 (paged) [13:48:22] <_joe_> !ack 5585 [13:48:22] 5585 (ACKED) db1170 (paged)/MariaDB Replica SQL: s7 (paged) [13:48:26] should I run the fix? [13:48:32] <_joe_> please do [13:48:36] doing [13:48:43] <_joe_> I was going to if none of you were around :) [13:48:50] let me depool first [13:49:12] (03Abandoned) 10Brouberol: airflow: enable the injection of custom config files in the worker pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109670 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [13:49:55] !log jynus@cumin1002 dbctl commit (dc=all): 'depool db1170', diff saved to https://phabricator.wikimedia.org/P71983 and previous config saved to /var/cache/conftool/dbconfig/20250110-134954-jynus.json [13:50:09] it's a dump vlow host [13:50:31] _joe_: if still around, can you confirm mw is happy, while I do the SQL? [13:50:36] <_joe_> yes [13:50:37] or anyone else [13:50:39] <_joe_> I'm looking [13:50:52] should had had very little impact, just the probes [13:51:04] I can take care of that [13:51:25] I am already about to run it [13:51:31] Excellent [13:51:37] Thank you so much jynus (and _joe_ ) [13:52:22] huwikis is larger than I expected [13:52:28] took 10 seconds [13:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:52:36] jynus: Once done, let me know so I can update and rebuild tables in all the other wikis [13:52:47] RECOVERY - MariaDB Replica SQL: s7 #page on db1170 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:52:54] \o/ [13:53:11] marostegui: do you want me to handover to you the host and you repool it, etc? [13:53:18] jynus: yeah I will take it from here [13:53:21] Thank you for fixing it [13:53:26] ok, then not touching it anymore [13:53:30] it is depooled atm [13:53:32] Thanks! [13:53:48] I filed the spreadsheet [13:53:58] ah, yes, it was the other thing I was going to mention [13:54:00] thanks [13:54:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1170.eqiad.wmnet with reason: maintenance [13:54:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: maintenance [13:54:28] oh, it is super late for me, going for lunch [13:54:34] jynus: :** [13:54:57] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@7a1a552]: Backfill 2024 12: cassandra_load_pageview_per_article [13:56:17] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@7a1a552]: Backfill 2024 12: cassandra_load_pageview_per_article (duration: 01m 19s) [13:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:58:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: maintenance [13:58:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: maintenance [13:58:26] Going to downtime more cause it won't finish today [13:59:53] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2199.codfw.wmnet with reason: host reimage [14:02:43] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2200.codfw.wmnet with reason: host reimage [14:03:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2199.codfw.wmnet with reason: host reimage [14:06:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2200.codfw.wmnet with reason: host reimage [14:16:26] (03PS1) 10Ssingh: P:dns::auth: remove redundant : in SAL log [puppet] - 10https://gerrit.wikimedia.org/r/1109708 [14:18:50] (03CR) 10Ssingh: [C:03+2] P:dns::auth: remove redundant : in SAL log [puppet] - 10https://gerrit.wikimedia.org/r/1109708 (owner: 10Ssingh) [14:23:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2199.codfw.wmnet with OS bookworm [14:23:55] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2201.codfw.wmnet with OS bookworm [14:24:06] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2201 [14:24:12] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:26:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2200.codfw.wmnet with OS bookworm [14:26:39] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2202.codfw.wmnet with OS bookworm [14:26:50] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2202 [14:27:16] (03PS1) 10Kamila Součková: decom wikikube-worker10[08-10,13,14,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/1109712 (https://phabricator.wikimedia.org/T375842) [14:27:35] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2201 - jelto@cumin1002" [14:27:38] 10SRE-Access-Requests: Create Kerberos identity for Jimmy Ly - https://phabricator.wikimedia.org/T381986#10448250 (10Gehel) [14:28:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 3 others: wdqs1025 fails to PXE boot, NIC shows "no link" in DRAC web UI - https://phabricator.wikimedia.org/T381283#10448256 (10Gehel) [14:29:18] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:30:11] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Request Kerberos identity for jsn.sherman - https://phabricator.wikimedia.org/T378786#10448279 (10Gehel) [14:30:29] 07sre-alert-triage: Alert in need of triage: ProbeDown (instance centrallog2002:6514) - https://phabricator.wikimedia.org/T377703#10448284 (10Gehel) [14:31:05] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for jebe - https://phabricator.wikimedia.org/T377490#10448288 (10Gehel) [14:32:35] (03PS2) 10Ladsgroup: mariadb: Add file tables and OAuthRateLimiter table to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/1109467 (https://phabricator.wikimedia.org/T363581) [14:33:09] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2202 - jelto@cumin1002" [14:33:12] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Add file tables and OAuthRateLimiter table to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/1109467 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:33:56] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T369829#10448327 (10Gehel) [14:35:44] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146#10448348 (10Gehel) [14:36:27] (03PS1) 10Brouberol: airflow-research: disable the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1109714 (https://phabricator.wikimedia.org/T380620) [14:40:30] (03CR) 10JMeybohm: [C:03+1] prometheus: k8s instances migration to prometheus::instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [14:40:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2201 - jelto@cumin1002" [14:42:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:42:36] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2201.codfw.wmnet 227.48.192.10.in-addr.arpa 7.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:42:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2201.codfw.wmnet 227.48.192.10.in-addr.arpa 7.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:42:40] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2201 [14:42:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2201 [14:42:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2201 [14:45:28] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:31] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:19] !log jelto@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2202 - jelto@cumin1002" [14:46:20] !log jelto@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:46:24] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:48:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [14:48:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:48:45] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2202.codfw.wmnet 226.48.192.10.in-addr.arpa 6.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:48:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2202.codfw.wmnet 226.48.192.10.in-addr.arpa 6.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:48:49] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2202 [14:49:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2202 [14:49:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2202 [14:51:18] (03PS1) 10Federico Ceratto: Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716 [14:51:55] (03CR) 10CI reject: [V:04-1] Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [14:52:43] (03CR) 10Jforrester: Disable Dns Blacklist checks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108179 (https://phabricator.wikimedia.org/T382987) (owner: 10Reedy) [14:54:52] (03PS2) 10Federico Ceratto: Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716 [14:56:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:20] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2201.codfw.wmnet with reason: host reimage [15:03:32] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2201.codfw.wmnet with reason: host reimage [15:08:30] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2202.codfw.wmnet with reason: host reimage [15:08:32] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:12:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2202.codfw.wmnet with reason: host reimage [15:14:18] (03CR) 10Reedy: Disable Dns Blacklist checks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108179 (https://phabricator.wikimedia.org/T382987) (owner: 10Reedy) [15:15:56] (03PS1) 10Marostegui: db2240: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109721 [15:16:48] (03PS1) 10Eevans: Upgrade to v1.0.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109722 (https://phabricator.wikimedia.org/T383371) [15:17:12] 10SRE-Access-Requests: Create Kerberos identity for Jimmy Ly - https://phabricator.wikimedia.org/T381986#10448506 (10Gehel) [15:17:38] (03CR) 10Marostegui: [C:03+2] db2240: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109721 (owner: 10Marostegui) [15:18:37] (03PS1) 10STran: ipoid: Bump activeDeadlineSeconds to 1 week [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109723 (https://phabricator.wikimedia.org/T374414) [15:18:44] (03CR) 10Elukey: Upgrade to v1.0.11 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109722 (https://phabricator.wikimedia.org/T383371) (owner: 10Eevans) [15:19:36] (03PS1) 10Marostegui: db2240: Make it candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1109724 (https://phabricator.wikimedia.org/T373579) [15:20:02] (03CR) 10Marostegui: [C:03+2] db2240: Make it candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1109724 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [15:20:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2240 to make it candidate master', diff saved to https://phabricator.wikimedia.org/P71984 and previous config saved to /var/cache/conftool/dbconfig/20250110-152035-marostegui.json [15:20:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2240.codfw.wmnet with reason: maintenance [15:21:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2240.codfw.wmnet with reason: maintenance [15:21:08] (03CR) 10Kosta Harlan: [C:03+1] ipoid: Bump activeDeadlineSeconds to 1 week [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109723 (https://phabricator.wikimedia.org/T374414) (owner: 10STran) [15:21:46] (03CR) 10Eevans: Upgrade to v1.0.11 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109722 (https://phabricator.wikimedia.org/T383371) (owner: 10Eevans) [15:22:12] (03PS2) 10Eevans: Upgrade date-gateway service to v1.0.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109722 (https://phabricator.wikimedia.org/T383371) [15:22:33] (03CR) 10Herron: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [15:25:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 3 others: wdqs1025 fails to PXE boot, NIC shows "no link" in DRAC web UI - https://phabricator.wikimedia.org/T381283#10448549 (10Gehel) [15:25:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db2240.codfw.wmnet with reason: maintenance [15:25:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2240.codfw.wmnet with reason: maintenance [15:25:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2201.codfw.wmnet with OS bookworm [15:30:21] 07sre-alert-triage: Alert in need of triage: ProbeDown (instance centrallog2002:6514) - https://phabricator.wikimedia.org/T377703#10448612 (10Gehel) [15:31:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Request Kerberos identity for jsn.sherman - https://phabricator.wikimedia.org/T378786#10448630 (10Gehel) [15:32:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2202.codfw.wmnet with OS bookworm [15:32:45] !log homer 'lsw1-d3-codfw*' commit 'T377877' [15:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:49] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [15:33:21] !log homer 'lsw1-d5-codfw*' commit 'T377877' [15:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:57] !log homer 'lsw1-d1-codfw*' commit 'T377877' [15:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:46] !log homer 'cr*codfw*' commit 'T377877' [15:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:14] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for jebe - https://phabricator.wikimedia.org/T377490#10448675 (10Gehel) [15:35:23] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 146, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:14] (03PS1) 10JMeybohm: admin_ng RBAC: Fix prometheus clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109728 (https://phabricator.wikimedia.org/T383413) [15:36:28] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2199-2202].codfw.wmnet [15:36:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2199-2202].codfw.wmnet [15:38:13] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341#10448708 (10Jelto) [15:41:32] (03PS1) 10Cathal Mooney: DNS Template include statements for new WMCS IPv6 ranges [dns] - 10https://gerrit.wikimedia.org/r/1109732 (https://phabricator.wikimedia.org/T379283) [15:42:39] (03CR) 10Eevans: [C:03+2] Upgrade date-gateway service to v1.0.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109722 (https://phabricator.wikimedia.org/T383371) (owner: 10Eevans) [15:42:55] (03CR) 10CI reject: [V:04-1] DNS Template include statements for new WMCS IPv6 ranges [dns] - 10https://gerrit.wikimedia.org/r/1109732 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [15:43:27] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T369829#10448754 (10Gehel) [15:43:41] (03Merged) 10jenkins-bot: Upgrade date-gateway service to v1.0.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109722 (https://phabricator.wikimedia.org/T383371) (owner: 10Eevans) [15:45:43] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [15:46:00] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [15:47:14] (03PS1) 10JMeybohm: kubelet: Use the chained certificate for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1109733 (https://phabricator.wikimedia.org/T383413) [15:47:16] (03PS1) 10JMeybohm: prometheus::k8s: Move away from kubelet readOnlyPort [puppet] - 10https://gerrit.wikimedia.org/r/1109734 (https://phabricator.wikimedia.org/T383413) [15:47:17] (03PS1) 10JMeybohm: kubelet: Disable the readOnlyPort [puppet] - 10https://gerrit.wikimedia.org/r/1109735 (https://phabricator.wikimedia.org/T383413) [15:48:15] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109735 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [15:48:17] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109734 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [15:48:18] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109733 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [15:48:58] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad 1 VM for Matomo - https://phabricator.wikimedia.org/T362146#10448800 (10Gehel) [15:50:36] (03CR) 10Ssingh: [C:03+1] "trust the script, Luke" [dns] - 10https://gerrit.wikimedia.org/r/1109732 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [15:54:33] (03PS1) 10Dreamrimmer: Enable abusefilter-log-detail for autoconfirmed users on en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109736 (https://phabricator.wikimedia.org/T383332) [15:57:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109736 (https://phabricator.wikimedia.org/T383332) (owner: 10Dreamrimmer) [15:59:53] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#10448873 (10Gehel) [16:19:54] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:23:48] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:23:51] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:25:47] !log eevans@deploy2002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [16:26:20] !log eevans@deploy2002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [16:27:03] 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10449030 (10Kgraessle) >>! In T383241#10446369, @Dzahn wrote: > Hello @Kgraessle > > it looks to me like you already have shell access, an SSH key and membership i... [16:27:46] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [16:28:02] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names in newly assigned wmcs private ipv6 ranges - cmooney@cumin1002" [16:28:06] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [16:28:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names in newly assigned wmcs private ipv6 ranges - cmooney@cumin1002" [16:28:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:36:37] (03CR) 10Andrew Bogott: [C:04-1] "I am no longer sure that we're ready for this. Attached task has details." [puppet] - 10https://gerrit.wikimedia.org/r/1109176 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [16:41:35] 10ops-codfw, 06SRE, 06DC-Ops: Fatal error detected on elastic2088 - https://phabricator.wikimedia.org/T361286#10449171 (10Gehel) [16:43:19] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:47:20] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:49:43] (03PS1) 10Marostegui: db2235: Add small note [puppet] - 10https://gerrit.wikimedia.org/r/1109748 [16:50:39] (03CR) 10Marostegui: [C:03+2] db2235: Add small note [puppet] - 10https://gerrit.wikimedia.org/r/1109748 (owner: 10Marostegui) [16:50:53] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names in newly assigned wmcs private ipv6 ranges - cmooney@cumin1002" [16:50:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names in newly assigned wmcs private ipv6 ranges - cmooney@cumin1002" [16:50:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:52:19] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:53:08] (03PS2) 10Cathal Mooney: DNS Template include statements for new WMCS IPv6 ranges [dns] - 10https://gerrit.wikimedia.org/r/1109732 (https://phabricator.wikimedia.org/T379283) [16:54:52] (03CR) 10Cathal Mooney: [C:03+2] DNS Template include statements for new WMCS IPv6 ranges [dns] - 10https://gerrit.wikimedia.org/r/1109732 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [16:55:12] !log cmooney@dns2005 START - running authdns-update [16:56:37] !log cmooney@dns2005 END - running authdns-update [17:02:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295#10449296 (10Gehel) [17:04:01] 06SRE, 10[DEPRECATED] wdwb-tech, 10API Platform, 06cloud-services-team, and 15 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953#10449301 (10Gehel) [17:04:08] 10ops-codfw, 06SRE, 06DC-Ops: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659#10449303 (10Gehel) [17:04:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10Event-Platform: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296#10449307 (10Gehel) [17:04:52] 06SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645#10449311 (10Gehel) [17:04:55] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#10449312 (10Gehel) [17:06:13] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10449329 (10Jhancock.wm) [17:10:27] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [17:12:58] (03PS1) 10Eevans: ml-cache: upgrade Cassandra to 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1109750 (https://phabricator.wikimedia.org/T380420) [17:13:27] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109750 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [17:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10449350 (10phaultfinder) [17:34:46] (03PS4) 10Bking: cloudelastic: add cloudelastic10[12] into production [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) [17:35:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [17:44:50] (03PS1) 10Ladsgroup: Add wikitech.wikimedia.org to list of local vhosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305) [18:02:37] (03PS1) 10CDanis: OpenTelemetry tracing to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109754 (https://phabricator.wikimedia.org/T340552) [18:03:05] (03CR) 10CDanis: [C:04-2] "Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109754 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [18:25:22] (03CR) 10Ryan Kemper: "We probably need entries in hieradata/role/eqiad/elasticsearch/cloudelastic.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [18:27:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10449494 (10VRiley-WMF) [18:39:21] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10449513 (10VRiley-WMF) 05Open→03In progress Replacing now [18:42:19] PROBLEM - Host ms-be1090 is DOWN: PING CRITICAL - Packet loss = 100% [18:44:30] (03PS1) 10Ssingh: P:mediawiki::maintenance: remove obsolete include for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/1109755 [18:45:18] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4779/console" [puppet] - 10https://gerrit.wikimedia.org/r/1109755 (owner: 10Ssingh) [18:45:28] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:08] (03CR) 10Majavah: [C:03+1] "+1 for fixing the immediate issue with puppet runs, although I guess we should check whether anything else on these hosts depended on the " [puppet] - 10https://gerrit.wikimedia.org/r/1109755 (owner: 10Ssingh) [18:48:48] (03CR) 10Ssingh: [V:03+1 C:03+2] P:mediawiki::maintenance: remove obsolete include for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/1109755 (owner: 10Ssingh) [18:49:33] !log sudo cumin 'P:Mediawiki::Maintenance' 'run-puppet-agent': CR 1109755 [18:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:21] RECOVERY - Host ms-be1090 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [19:04:19] (03PS1) 10Bartosz Dziewoński: Add license messages for new Wikinews licenses [extensions/WikimediaMessages] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109756 (https://phabricator.wikimedia.org/T383338) [19:04:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaMessages] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109756 (https://phabricator.wikimedia.org/T383338) (owner: 10Bartosz Dziewoński) [19:08:32] (03CR) 10Eevans: [C:03+2] ml-cache: upgrade Cassandra to 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1109750 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [19:12:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10449615 (10VRiley-WMF) The drive has been replaced and the unit sees the device, but it seems as though it is not apart of the JBOD. investigating to see if there is another step. [19:23:33] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [19:23:37] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [19:26:17] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:26:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:26:36] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on eventlog1003.eqiad.wmnet with reason: Shutting down VM in preparation for decommissioning [19:26:50] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on eventlog1003.eqiad.wmnet with reason: Shutting down VM in preparation for decommissioning [19:27:07] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:27:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:31:41] (03CR) 10Btullis: "I did a `sudo systemctl reset-failed` on eventlog1003 to clear the alerts." [puppet] - 10https://gerrit.wikimedia.org/r/1109157 (owner: 10Ottomata) [19:41:15] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [19:41:19] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [19:43:29] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [20:01:14] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [20:01:17] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [20:04:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:22:43] (03PS1) 10Eevans: cassandra: rotate target_version 'dev' to '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1109767 (https://phabricator.wikimedia.org/T380420) [20:24:11] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109767 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [20:24:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:28:12] (03PS2) 10Eevans: cassandra: rotate target_version 'dev' to '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1109767 (https://phabricator.wikimedia.org/T380420) [20:28:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:29:03] (03PS3) 10Eevans: cassandra: rotate target_version 'dev' to '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1109767 (https://phabricator.wikimedia.org/T380420) [20:30:00] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109767 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [20:39:42] FIRING: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:43:10] (03PS1) 10Eevans: cassandra: set target_dev to 4.x (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/1109768 (https://phabricator.wikimedia.org/T380420) [20:44:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:45:14] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109768 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [20:47:10] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109768 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [20:48:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:10:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:12:24] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:12:53] 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10449865 (10Dzahn) Hi @Kgraessle first, I can confirm your user exists on the machine stat1008 and is also in the group in question. So it should be a matter of c... [21:15:31] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:18:06] (03Abandoned) 10Pppery: Disable DeadEndPages and LonelyPages on Commons per community request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095334 (https://phabricator.wikimedia.org/T371662) (owner: 10Pppery) [21:18:44] (03Abandoned) 10Pppery: Don't try to update Special:DeadEndPages on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1095353 (https://phabricator.wikimedia.org/T371662) (owner: 10Pppery) [21:22:42] Not sure what would need reverting, but T383415 is problematic. [21:22:43] T383415: [wmf.11 - regression] Custom tags not working with UploadWizard - https://phabricator.wikimedia.org/T383415 [21:24:06] (03PS5) 10Bking: cloudelastic: add cloudelastic10[12] into production [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) [21:24:49] 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10449905 (10Kgraessle) >>! In T383241#10449865, @Dzahn wrote: > Hi @Kgraessle > > first, I can confirm your user exists on the machine stat1008 and is also in the... [21:25:23] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [21:32:25] 06SRE-OnFire, 10Observability-Alerting, 10Sustainability (Incident Followup): api_appserver Average latency exceeded alert fired late when latency was declining again - https://phabricator.wikimedia.org/T334949#10449915 (10andrea.denisse) 05Open→03Declined Closing as this hasn't happened in a long time. [21:32:34] (03CR) 10Ryan Kemper: [C:03+1] cloudelastic: add cloudelastic10[12] into production [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [21:33:02] (03CR) 10Bking: [C:03+2] cloudelastic: add cloudelastic10[12] into production [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [21:35:38] 06SRE, 10Observability-Alerting: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869#10449928 (10andrea.denisse) p:05Triage→03Medium [21:44:48] 06SRE, 10Observability-Alerting: prometheus-icinga-am.service Fails to Start on alert2001 - https://phabricator.wikimedia.org/T358838#10449940 (10andrea.denisse) p:05Triage→03Medium [21:46:29] 07Puppet, 10MW-on-K8s, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q2): Clean up "git repo needs merge" checks - https://phabricator.wikimedia.org/T370530#10449955 (10andrea.denisse) p:05Triage→03Low [21:56:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:57:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on cloudelastic1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [22:01:38] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:07:28] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on cloudelastic1011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [22:10:09] (03PS1) 10Dzahn: Revert "add za.wikimedia.org and za.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) [22:10:17] (03CR) 10CI reject: [V:04-1] Revert "add za.wikimedia.org and za.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn) [22:10:32] (03PS2) 10Dzahn: Revert "add za.wikimedia.org and za.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) [22:10:39] (03CR) 10CI reject: [V:04-1] Revert "add za.wikimedia.org and za.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn) [22:12:00] (03PS1) 10Dzahn: Revert "Add uz.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109778 (https://phabricator.wikimedia.org/T382730) [22:15:43] (03CR) 10Dzahn: "will rebase before merge, aware it needs that" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn) [22:16:37] (03CR) 10Dzahn: "ack" [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn) [22:45:28] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:45:42] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1005,cloudelastic1006 for ban hosts prior to decom - bking@cumin2002 - T380937 [22:45:43] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: cloudelastic1005,cloudelastic1006 for ban hosts prior to decom - bking@cumin2002 - T380937 [22:45:45] T380937: decommission cloudelastic100[5-6] - https://phabricator.wikimedia.org/T380937 [22:45:52] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1005*,cloudelastic1006* for ban hosts prior to decom - bking@cumin2002 - T380937 [22:45:56] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1005*,cloudelastic1006* for ban hosts prior to decom - bking@cumin2002 - T380937 [22:47:17] (03PS1) 10Cwhite: logstash: ensure service is a hash at ecs pre-filter step [puppet] - 10https://gerrit.wikimedia.org/r/1109781 (https://phabricator.wikimedia.org/T382105) [22:48:15] (03PS2) 10Cwhite: logstash: ensure service is a hash at ecs pre-filter step [puppet] - 10https://gerrit.wikimedia.org/r/1109781 (https://phabricator.wikimedia.org/T382105) [22:49:24] (03PS3) 10Cwhite: logstash: ensure service is a hash at ecs pre-filter step [puppet] - 10https://gerrit.wikimedia.org/r/1109781 (https://phabricator.wikimedia.org/T382105) [22:51:31] (03CR) 10CI reject: [V:04-1] logstash: ensure service is a hash at ecs pre-filter step [puppet] - 10https://gerrit.wikimedia.org/r/1109781 (https://phabricator.wikimedia.org/T382105) (owner: 10Cwhite) [22:51:38] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:52:46] (03PS4) 10Cwhite: logstash: ensure service is a hash at ecs pre-filter step [puppet] - 10https://gerrit.wikimedia.org/r/1109781 (https://phabricator.wikimedia.org/T382105) [22:56:38] RESOLVED: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:58:06] (03CR) 10Cwhite: [C:03+2] logstash: ensure service is a hash at ecs pre-filter step [puppet] - 10https://gerrit.wikimedia.org/r/1109781 (https://phabricator.wikimedia.org/T382105) (owner: 10Cwhite) [23:06:33] PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:07:23] RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:27:31] (03CR) 10BCornwall: [C:03+1] Revert "add za.wikimedia.org and za.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn) [23:28:59] (03CR) 10BCornwall: [C:03+1] certificates: add wiki[m|p]edia.ro to ncredir Letsencrypt cert 7 [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn) [23:29:07] (03CR) 10BCornwall: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn) [23:30:12] (03CR) 10BCornwall: [C:03+1] Revert "Add uz.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109778 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn)