[00:00:33] RECOVERY - Check systemd state on centrallog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:27] RECOVERY - Check systemd state on centrallog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:04] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:47] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:47] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981435 [00:38:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981435 (owner: 10TrainBranchBot) [00:39:27] PROBLEM - Check systemd state on centrallog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:31] PROBLEM - Check systemd state on centrallog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:51] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:57:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981435 (owner: 10TrainBranchBot) [01:08:00] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:15:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:09] (03PS1) 10Andrew Bogott: Update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/981702 [01:51:43] (03CR) 10Andrew Bogott: [C: 03+2] Update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/981702 (owner: 10Andrew Bogott) [02:01:19] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:24:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:24:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:25:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:26:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:39:09] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:35] (03PS1) 10RLazarus: admin_ng: Add namespace and ClusterRole for Job sidecar controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) [02:42:51] (03PS1) 10RLazarus: admin_ng: Switch on enableJobSidecarController for mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/981704 (https://phabricator.wikimedia.org/T348284) [03:05:29] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:06:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:09:09] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:12:02] (03PS2) 10Stang: zhwiki: Remove abusefilter-view-private from rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949613 (https://phabricator.wikimedia.org/T344398) [03:33:23] (03PS1) 10Andrew Bogott: Horizon: update version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/981706 (https://phabricator.wikimedia.org/T326818) [03:34:30] (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:35:54] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/981706 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott) [03:46:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:47:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:56:53] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:23] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:25:03] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:45] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:28:15] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:34:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:37:04] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:44:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:51:01] (03PS1) 10KartikMistry: Update MinT to 2023-12-08-151348-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/981709 (https://phabricator.wikimedia.org/T352690) [05:34:06] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981710 (https://phabricator.wikimedia.org/T351787) [05:34:34] (03PS1) 10Marostegui: pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/981711 [05:34:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: pc1 master switch T351787 [05:34:50] T351787: Upgrade pc1 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351787 [05:35:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: pc1 master switch T351787 [05:35:18] (03CR) 10Marostegui: [C: 03+2] pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/981711 (owner: 10Marostegui) [05:35:31] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981710 (https://phabricator.wikimedia.org/T351787) (owner: 10Marostegui) [05:36:23] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981710 (https://phabricator.wikimedia.org/T351787) (owner: 10Marostegui) [05:37:14] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:981710|ProductionServices.php: Promote pc1014 to pc1 (T351787)]] [05:37:18] (03PS1) 10Marostegui: pc1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/981712 [05:37:50] (03CR) 10Marostegui: [C: 03+2] pc1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/981712 (owner: 10Marostegui) [05:46:43] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:981710|ProductionServices.php: Promote pc1014 to pc1 (T351787)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [05:46:47] T351787: Upgrade pc1 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351787 [05:47:26] !log marostegui@deploy2002 marostegui: Continuing with sync [05:54:09] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:981710|ProductionServices.php: Promote pc1014 to pc1 (T351787)]] (duration: 16m 54s) [05:54:13] T351787: Upgrade pc1 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351787 [05:55:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1011.eqiad.wmnet with OS bookworm [06:03:50] marostegui: OK to deploy MinT? [06:05:30] kart_: go for it! [06:06:40] Thanks! [06:07:02] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:07:06] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:07:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1011.eqiad.wmnet with reason: host reimage [06:07:25] ah. I forgot to merge the patch ;) [06:07:44] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-12-08-151348-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/981709 (https://phabricator.wikimedia.org/T352690) (owner: 10KartikMistry) [06:08:22] (03PS1) 10Marostegui: Revert "pc1014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981727 [06:08:50] (03Merged) 10jenkins-bot: Update MinT to 2023-12-08-151348-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/981709 (https://phabricator.wikimedia.org/T352690) (owner: 10KartikMistry) [06:09:08] (03PS1) 10Marostegui: Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981728 [06:10:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1011.eqiad.wmnet with reason: host reimage [06:13:13] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:13:21] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981729 [06:13:29] (03CR) 10Marostegui: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981729 (owner: 10Marostegui) [06:16:34] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:19:56] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:24:12] (03CR) 10Marostegui: [C: 03+2] Revert "pc1014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981727 (owner: 10Marostegui) [06:26:47] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:26:55] kart_: Can I deploy? [06:27:00] Once you are done [06:27:39] marostegui: Yes. [06:27:56] marostegui: MinT deployment is little slow.. [06:28:21] kart_: no problem, let me know when I can [06:28:40] Sure [06:28:52] thank you [06:29:02] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:29:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1011.eqiad.wmnet with OS bookworm [06:29:54] (03CR) 10Marostegui: [C: 03+2] Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981728 (owner: 10Marostegui) [06:30:11] (03CR) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981729 (owner: 10Marostegui) [06:30:48] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10Marostegui) Thank you! The host looks all green in Icinga! [06:32:39] (03PS1) 10MilkyDefer: Enable action blocks for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) [06:32:41] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) (owner: 10MilkyDefer) [06:34:17] (03CR) 10Marostegui: [C: 03+1] "Is this still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/910598 (https://phabricator.wikimedia.org/T331706) (owner: 10Ladsgroup) [06:34:20] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:34:36] marostegui: done. [06:34:40] kart_: thanks! [06:34:42] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981729 (owner: 10Marostegui) [06:35:00] <_joe_> !log update sirenbot to 0.3.7 [06:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:26] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981729 (owner: 10Marostegui) [06:35:43] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:981729|Revert "ProductionServices.php: Promote pc1014 to pc1"]] [06:37:01] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:981729|Revert "ProductionServices.php: Promote pc1014 to pc1"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:37:18] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) (owner: 10MilkyDefer) [06:37:23] !log marostegui@deploy2002 marostegui: Continuing with sync [06:44:06] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:981729|Revert "ProductionServices.php: Promote pc1014 to pc1"]] (duration: 08m 22s) [06:44:39] kart_: I am done with all my deployments [06:45:27] cool. I'm also :) [06:59:36] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/981716 (https://phabricator.wikimedia.org/T351864) [07:05:30] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:07:15] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:08:17] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:33] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/981716 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [07:12:12] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/981716 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [07:12:44] !log Failvoer m3-master from dbproxy1020 to dbproxy1026 org [07:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:50] !log Failvoer m3-master from dbproxy1020 to dbproxy1026 T351864 [07:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:54] T351864: Migrate dbproxy hosts to Bookworm - https://phabricator.wikimedia.org/T351864 [07:24:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2185.codfw.wmnet with reason: reboot for upgrade [07:24:24] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on db2185.codfw.wmnet with reason: reboot for upgrade [07:31:38] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2185.codfw.wmnet with reason: reboot for upgrade [07:31:41] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2185.codfw.wmnet with reason: reboot for upgrade [07:34:31] (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:39:30] (03CR) 10Slyngshede: [C: 03+2] Keymanagement: SSH keys are in some cases not synced to LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/978056 (https://phabricator.wikimedia.org/T351139) (owner: 10Slyngshede) [07:39:32] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Keymanagement: SSH keys are in some cases not synced to LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/978056 (https://phabricator.wikimedia.org/T351139) (owner: 10Slyngshede) [07:41:33] (03PS1) 10Slyngshede: C:idm:deployment restart Bitu on configuration changes. [puppet] - 10https://gerrit.wikimedia.org/r/981942 [07:43:20] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/857/con" [puppet] - 10https://gerrit.wikimedia.org/r/981942 (owner: 10Slyngshede) [07:49:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:53:41] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: reboot for upgrade [07:53:55] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: reboot for upgrade [07:54:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:00:05] Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T0800). [08:00:05] xSavitar and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:26] hi [08:00:29] o/ [08:01:34] kostajh, do you want to deploy first? [08:01:39] sure [08:01:49] Okay, ping me when you're done and I'll take it from there [08:02:06] xSavitar: I can sync your patch but was wondering about the comment you left in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/976252/23#message-943e233d7e8473f64d21d5f2948873072b55d999. Is anyone calling isTest()? [08:02:59] kostajh, I don't see any public consumers for now, per code search. So I'm going to test this on a debug host (internally) to make sure it's doing the right thing. So we're good. [08:03:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan) [08:03:18] If you want, you can sync it and I'll test too. [08:03:31] That's after your own patch is done. [08:05:56] (03Merged) 10jenkins-bot: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan) [08:06:19] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:979969|MediaModeration: Set MediaModerationDeveloperMode to false]] [08:07:56] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:979969|MediaModeration: Set MediaModerationDeveloperMode to false]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:09:05] (03PS5) 10Effie Mouzeli: [admin] Add ehughes shell account with no ssh key [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) (owner: 10EoghanGaffney) [08:09:12] !log kharlan@deploy2002 kharlan: Continuing with sync [08:10:45] (03CR) 10JMeybohm: admin_ng: Add namespace and ClusterRole for Job sidecar controller (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [08:11:38] (03CR) 10JMeybohm: [C: 03+1] admin_ng: Switch on enableJobSidecarController for mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/981704 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [08:14:49] (03CR) 10Effie Mouzeli: [C: 03+2] [admin] Add ehughes shell account with no ssh key [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) (owner: 10EoghanGaffney) [08:15:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: reboot for upgrade [08:15:35] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: reboot for upgrade [08:16:14] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:16:15] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:979969|MediaModeration: Set MediaModerationDeveloperMode to false]] (duration: 09m 55s) [08:19:45] xSavitar: ok, on to your patch [08:19:52] Ack [08:20:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [08:20:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981424 (https://phabricator.wikimedia.org/T304604) (owner: 10Kosta Harlan) [08:21:02] (03Merged) 10jenkins-bot: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [08:21:06] (03Merged) 10jenkins-bot: IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981424 (https://phabricator.wikimedia.org/T304604) (owner: 10Kosta Harlan) [08:21:20] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:976252|ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (T347366)]], [[gerrit:981424|IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath (T304604)]] [08:21:25] T347366: Follow-up on wmf-config "ClusterConfig::isTest" method - https://phabricator.wikimedia.org/T347366 [08:21:26] T304604: Set config for path to MaxMind files on production - https://phabricator.wikimedia.org/T304604 [08:22:30] (03PS8) 10Brouberol: An an option to configure the event log storage location for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/980859 (https://phabricator.wikimedia.org/T352849) [08:22:42] !log kharlan@deploy2002 kharlan and d3r1ck01: Backport for [[gerrit:976252|ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (T347366)]], [[gerrit:981424|IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath (T304604)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:23:44] kostajh, looks like I can test now? [08:24:09] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:25:21] xSavitar: yes, please go ahead [08:25:31] Okay, testing now... [08:27:50] (03PS1) 10Brouberol: Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) [08:28:19] (03CR) 10CI reject: [V: 04-1] Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [08:29:09] (JobUnavailable) resolved: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:29:22] (03PS2) 10Brouberol: Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) [08:31:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:31:16] (03PS1) 10Effie Mouzeli: admin: Add ehughes to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/981946 (https://phabricator.wikimedia.org/T351387) [08:31:39] kostajh, I've tested on a debug host and on a k8s REPL. The later works fine meaning the patch is doing what it's expected to do. [08:32:35] But the former case didn't work and I know why. The hostname has is just "deploy2002" so the code doesn't see "debug" in the name (as in the hostnames before). Looks like they've been renamed? [08:32:53] I'll talk with Krinkle about this and see if we can improve the patch a little bit. But yes, it works. :) [08:33:20] (03CR) 10Effie Mouzeli: [C: 03+2] admin: Add ehughes to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/981946 (https://phabricator.wikimedia.org/T351387) (owner: 10Effie Mouzeli) [08:33:36] I remember seeing hostnames like: mwdebug1001 or something like that :) [08:34:03] kostajh, so yeah, you can sync this :), I'll leave a comment on the task in Phab [08:35:50] xSavitar: OK [08:36:11] !log kharlan@deploy2002 kharlan and d3r1ck01: Continuing with sync [08:37:05] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:40:12] !log restarted blazegraph on wdqs1006 (BlazegraphFreeAllocatorsDecreasingRapidly) [08:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:22] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:976252|ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (T347366)]], [[gerrit:981424|IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath (T304604)]] (duration: 22m 02s) [08:43:28] T347366: Follow-up on wmf-config "ClusterConfig::isTest" method - https://phabricator.wikimedia.org/T347366 [08:43:28] T304604: Set config for path to MaxMind files on production - https://phabricator.wikimedia.org/T304604 [08:43:29] !log UTC morning deploys done [08:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:49] kostajh, thank you very much for deploying my patch. I appreciate [08:45:36] (03PS1) 10Effie Mouzeli: admin: add mcastro-wmf to ldap_only _users [puppet] - 10https://gerrit.wikimedia.org/r/981947 (https://phabricator.wikimedia.org/T352684) [08:46:28] (03CR) 10CI reject: [V: 04-1] admin: add mcastro-wmf to ldap_only _users [puppet] - 10https://gerrit.wikimedia.org/r/981947 (https://phabricator.wikimedia.org/T352684) (owner: 10Effie Mouzeli) [08:48:20] (03PS2) 10Effie Mouzeli: admin: add mcastro-wmf to ldap_only _users [puppet] - 10https://gerrit.wikimedia.org/r/981947 (https://phabricator.wikimedia.org/T352684) [08:50:31] (03CR) 10Effie Mouzeli: [C: 03+2] admin: add mcastro-wmf to ldap_only _users [puppet] - 10https://gerrit.wikimedia.org/r/981947 (https://phabricator.wikimedia.org/T352684) (owner: 10Effie Mouzeli) [08:54:58] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for mcastro-wmf - https://phabricator.wikimedia.org/T352684 (10jijiki) @Mcastro done, please reopen if something is not right [08:55:11] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for mcastro-wmf - https://phabricator.wikimedia.org/T352684 (10jijiki) 05Open→03Resolved a:03jijiki [08:55:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10jijiki) 05Open→03Resolved done:) [08:59:21] (03PS18) 10Stevemunene: C:bigtop::hadoop switch to new topology script. [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [09:04:40] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] kubernetes: Remove cergen certs from kubernetes secrets [labs/private] - 10https://gerrit.wikimedia.org/r/980891 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:10:58] (03PS1) 10Jelto: phabricator: add dedicated blackbox check for collab team and severity task [puppet] - 10https://gerrit.wikimedia.org/r/981951 (https://phabricator.wikimedia.org/T343517) [09:11:41] (03CR) 10CI reject: [V: 04-1] phabricator: add dedicated blackbox check for collab team and severity task [puppet] - 10https://gerrit.wikimedia.org/r/981951 (https://phabricator.wikimedia.org/T343517) (owner: 10Jelto) [09:12:55] (03PS2) 10Jelto: phabricator: add dedicated blackbox check for collab team and severity task [puppet] - 10https://gerrit.wikimedia.org/r/981951 (https://phabricator.wikimedia.org/T343517) [09:23:32] (03PS3) 10Jelto: wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (https://phabricator.wikimedia.org/T352003) (owner: 10Giuseppe Lavagetto) [09:24:25] (03PS4) 10Jelto: wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (https://phabricator.wikimedia.org/T352003) (owner: 10Giuseppe Lavagetto) [09:27:24] (03CR) 10Jelto: "rebased to latest weekly rebuild" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (https://phabricator.wikimedia.org/T352003) (owner: 10Giuseppe Lavagetto) [09:27:51] (03PS14) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [09:29:00] (03CR) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [09:30:37] (03CR) 10Jelto: [V: 03+2 C: 03+2] wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (https://phabricator.wikimedia.org/T352003) (owner: 10Giuseppe Lavagetto) [09:33:28] (03PS1) 10Vgutierrez: hiera: Disable rp_filter for ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/981955 (https://phabricator.wikimedia.org/T351069) [09:37:14] (03CR) 10Ayounsi: [C: 03+1] "lgtm, I also checked that the service names were the same in the "if $install_via_git" code path." [puppet] - 10https://gerrit.wikimedia.org/r/981942 (owner: 10Slyngshede) [09:37:25] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/863/con" [puppet] - 10https://gerrit.wikimedia.org/r/981955 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:38:03] (03PS3) 10D3r1ck01: ClusterConfig: Followup on I955168f072315e0064c69a66483e61dfc23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981954 (https://phabricator.wikimedia.org/T347366) [09:38:22] (03PS1) 10Elukey: service: update recommendation-api's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981956 (https://phabricator.wikimedia.org/T349118) [09:39:00] (03CR) 10Filippo Giunchedi: [C: 03+2] klaxon: Ensure the klaxon user has a home directory [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [09:40:04] (03CR) 10Filippo Giunchedi: [C: 03+1] openstack_apis_response: add value to the description [alerts] - 10https://gerrit.wikimedia.org/r/981450 (owner: 10David Caro) [09:40:45] (03CR) 10Elukey: [C: 03+2] service: update recommendation-api's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981956 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey) [09:41:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/981407 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [09:41:14] (03PS3) 10Brouberol: [yarn] Add the option to configure the spark history server address [puppet] - 10https://gerrit.wikimedia.org/r/981948 (https://phabricator.wikimedia.org/T352863) [09:43:07] (03PS3) 10Brouberol: Configure the Spark History server host for the an-test yarn [puppet] - 10https://gerrit.wikimedia.org/r/981949 (https://phabricator.wikimedia.org/T352863) [09:43:23] (03PS3) 10Brouberol: Configure the Spark History server host for the analytics yarn [puppet] - 10https://gerrit.wikimedia.org/r/981950 (https://phabricator.wikimedia.org/T352863) [09:44:05] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/867/con" [puppet] - 10https://gerrit.wikimedia.org/r/981950 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:44:17] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: sync [09:44:34] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync [09:48:35] (03CR) 10Hashar: "recheck after CI config https://gerrit.wikimedia.org/r/c/integration/config/+/981464" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [09:49:15] (03CR) 10CI reject: [V: 04-1] Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [09:50:15] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [09:50:43] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [09:54:19] (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/982038 (https://phabricator.wikimedia.org/T351069) [09:54:39] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 1547 [09:54:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 1547 [09:55:03] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [09:55:08] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync [09:55:33] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync [09:55:47] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/868/con" [puppet] - 10https://gerrit.wikimedia.org/r/982038 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:56:32] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38753 [09:57:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38753 [10:02:23] (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on text|secondary LVS in esams [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) [10:02:31] (03CR) 10David Caro: [C: 03+2] openstack_apis_response: add value to the description [alerts] - 10https://gerrit.wikimedia.org/r/981450 (owner: 10David Caro) [10:03:12] !log removed cergen certs of all k8s servies from private puppet in commit d36a97aa23e21824f95d22264d06e2c3bf3c6ac3 - T300033 [10:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:23] T300033: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 [10:03:55] 😬 [10:04:15] (03CR) 10David Caro: [C: 03+2] cloud: add missing codfw1dev:openstack_control_nodes [puppet] - 10https://gerrit.wikimedia.org/r/981448 (https://phabricator.wikimedia.org/T353048) (owner: 10David Caro) [10:04:16] 🤞 [10:04:31] (03Merged) 10jenkins-bot: openstack_apis_response: add value to the description [alerts] - 10https://gerrit.wikimedia.org/r/981450 (owner: 10David Caro) [10:04:51] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:06:14] (03CR) 10Btullis: "This looks good. Can we deploy manually to the test cluster first?" [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [10:06:21] (03CR) 10Btullis: [C: 03+1] C:bigtop::hadoop switch to new topology script. [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [10:07:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nicely done! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/981358 (https://phabricator.wikimedia.org/T163996) (owner: 10Majavah) [10:07:24] (03CR) 10Btullis: [C: 03+1] An an option to configure the event log storage location for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/980859 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [10:11:06] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:11:51] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/982038 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:12:54] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/981955 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:13:04] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on text|secondary LVS in esams [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:13:16] (03CR) 10Vgutierrez: [V: 03+1] hiera: Enable IPIP encapsulation on text|secondary LVS in esams [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:13:23] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Disable rp_filter for ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/981955 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:17:55] (03PS1) 10Elukey: Revert "service: update recommendation-api's docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981732 [10:20:54] (03CR) 10JMeybohm: "If this can wait another day, you could pull in the latest ingress module as well: I95662e864cd4e10cca9c5357db42deffd06ba9e9" [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey) [10:22:06] (03CR) 10David Caro: [C: 03+2] networktests: use tool network-tests instead of personal one [puppet] - 10https://gerrit.wikimedia.org/r/967932 (owner: 10David Caro) [10:23:40] (03CR) 10David Caro: [C: 03+2] disable_tool: use the gitlab repository [puppet] - 10https://gerrit.wikimedia.org/r/963260 (https://phabricator.wikimedia.org/T327057) (owner: 10David Caro) [10:24:18] (03CR) 10Hashar: [C: 03+1] "It is fine installing `wheel` with `pip` at least to be consistent with the other images :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [10:24:54] jouncebot: nowandnext [10:24:54] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [10:24:54] In 0 hour(s) and 35 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1100) [10:28:16] (03PS1) 10Jelto: wmf-debci: fix templating in Dockerfile RUN command [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982041 (https://phabricator.wikimedia.org/T352003) [10:31:00] (03CR) 10Jelto: "unfortunately there is a little typo in the template, this should fix the issue (tested locally)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982041 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto) [10:32:19] (03CR) 10Vgutierrez: "you need to pool at least 4 nodes before merging this. You could adjust the depool_threshold too" [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [10:33:08] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10hnowlan) [10:33:43] (03CR) 10Hnowlan: [C: 03+1] service_proxy/mesh: Bump to newer version globally [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [10:35:13] (03PS1) 10Ayounsi: Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 [10:35:54] (03CR) 10Elukey: [C: 03+2] Revert "service: update recommendation-api's docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981732 (owner: 10Elukey) [10:36:06] (03CR) 10Hashar: "The build fails cause `pip` in Debian Bookworm has been to error out when someone tries to install a package to `/usr/local/` with a more " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [10:37:23] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/982038 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:37:23] !log Repooling dse-k8s-worker nodes - sudo confctl select "service=kubesvc,cluster=dse-k8s" set/pooled=yes - T352639 [10:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:27] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [10:37:27] (03CR) 10CI reject: [V: 04-1] Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (owner: 10Ayounsi) [10:37:31] (03PS2) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/982038 (https://phabricator.wikimedia.org/T351069) [10:38:38] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: sync [10:38:50] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync [10:38:55] (03CR) 10Clément Goubert: [C: 03+1] Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [10:38:59] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [10:39:14] (03PS2) 10Hashar: Provide python3-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [10:39:41] (03CR) 10Vgutierrez: [C: 03+1] Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [10:42:23] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [10:42:23] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync [10:42:23] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync [10:43:06] (03CR) 10Brouberol: [C: 03+2] Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [10:45:52] !log Disabling puppet on O:lvs::balancer - T352639 [10:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:56] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [10:46:16] !log Running puppet on O:lvs::balancer - T352639 [10:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:01] (03PS1) 10Brouberol: Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service"" [dns] - 10https://gerrit.wikimedia.org/r/981733 [10:48:11] (03PS2) 10Brouberol: Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service"" [dns] - 10https://gerrit.wikimedia.org/r/981733 [10:50:16] !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[1019-1020].eqiad.wmnet} and A:lvs (T352639) [10:51:20] (03CR) 10Brouberol: "Not to be deployed until we have successfully deployed the k8s-ingress-dse LVS service in lvs_state" [dns] - 10https://gerrit.wikimedia.org/r/981733 (owner: 10Brouberol) [10:54:27] (03PS6) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) [10:54:48] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[1019-1020].eqiad.wmnet} and A:lvs (T352639) [10:54:52] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [10:55:48] !log (properly) restarting blazegraph on wdqs1006 (BlazegraphFreeAllocatorsDecreasingRapidly) [10:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1100) [11:01:56] (03CR) 10Brouberol: "This PR can now be deployed (when +2ed)" [dns] - 10https://gerrit.wikimedia.org/r/981733 (owner: 10Brouberol) [11:02:34] (03PS3) 10Brouberol: Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service"" [dns] - 10https://gerrit.wikimedia.org/r/981733 (https://phabricator.wikimedia.org/T352639) [11:03:32] (03CR) 10Clément Goubert: [C: 03+1] Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service"" [dns] - 10https://gerrit.wikimedia.org/r/981733 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:04:25] (03CR) 10Klausman: python-webapp: update mesh and base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey) [11:04:27] (03PS1) 10Brouberol: Switch state of k8s-ingress-dse LVS service to production [puppet] - 10https://gerrit.wikimedia.org/r/982045 (https://phabricator.wikimedia.org/T352639) [11:05:07] (03CR) 10Clément Goubert: [C: 03+1] Switch state of k8s-ingress-dse LVS service to production [puppet] - 10https://gerrit.wikimedia.org/r/982045 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:05:22] (03CR) 10Brouberol: [C: 03+2] Switch state of k8s-ingress-dse LVS service to production [puppet] - 10https://gerrit.wikimedia.org/r/982045 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:05:30] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:06:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:06:34] (03CR) 10Elukey: python-webapp: update mesh and base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey) [11:10:25] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 128, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:12:10] (03CR) 10Brouberol: [C: 03+2] Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service"" [dns] - 10https://gerrit.wikimedia.org/r/981733 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:12:51] !log Add discovery records for the k8s-ingress-dse LVS service - T352639 [11:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:55] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [11:14:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 127, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:15:16] (03CR) 10Reedy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [11:15:59] (03CR) 10Reedy: Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [11:16:22] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on text|secondary LVS in esams [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:16:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [11:16:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [11:16:57] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982041 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto) [11:18:29] (03PS4) 10EoghanGaffney: [apt-staging] Add script to pull artifacts from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/979912 [11:18:38] !log sudo confctl --object-type discovery select 'name=eqiad,dnsdisc=k8s-ingress-dse' set/pooled=true - T352639 [11:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:41] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [11:19:23] (03CR) 10Reedy: Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [11:20:03] !log rolling restart of pybal on lvs3010 and lvs3008 effectively enabling IPIP encapsulation on ncredir@esams - T351069 [11:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:08] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [11:22:08] 10SRE-swift-storage, 10Observability-Metrics, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10MatthewVernon) @fgiunchedi are you in a position to reduce some thanos disk usage/retention? Most swift drives are 93/4% full now: ` mvernon@thanos-fe1001:~$ sudo swift-... [11:24:06] 10SRE-swift-storage, 10Observability-Metrics, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10MatthewVernon) Quite significant growth in thanos disk usage over the last 6 months: https://grafana.wikimedia.org/d/NDWQoBiGk/thanos-swift?orgId=1&from=1686482606897&to... [11:30:31] (03PS2) 10Dreamy Jazz: Enable read new on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979986 (https://phabricator.wikimedia.org/T341829) [11:30:43] (03CR) 10Clément Goubert: [C: 03+1] Provide python3-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [11:31:53] (03PS1) 10Urbanecm: Revert "Growth: Enable Welcome survey user research for ar/en/es" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981734 (https://phabricator.wikimedia.org/T351266) [11:32:00] (03PS2) 10Urbanecm: Revert "Growth: Enable Welcome survey user research for ar/en/es" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981734 (https://phabricator.wikimedia.org/T351266) [11:34:13] 10SRE, 10Observability-Metrics, 10Goal, 10Patch-For-Review: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10elukey) @colewhite hi again! I added some context to https://gerrit.wikimedia.org/r/c/mediawiki/services/recommendation-api/+/982047, now I have a better idea about... [11:34:31] (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:36:24] (03CR) 10Ladsgroup: mariadb: Add lists1003 grants for mailman dbs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910598 (https://phabricator.wikimedia.org/T331706) (owner: 10Ladsgroup) [11:36:29] (03Abandoned) 10Ladsgroup: mariadb: Add lists1003 grants for mailman dbs [puppet] - 10https://gerrit.wikimedia.org/r/910598 (https://phabricator.wikimedia.org/T331706) (owner: 10Ladsgroup) [11:38:07] (03CR) 10Hnowlan: changeprop: refactor templating for Kafka producer/consumer settings (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [11:45:51] (03CR) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [11:48:20] (03CR) 10Klausman: python-webapp: update mesh and base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey) [11:50:48] (03CR) 10Hashar: contint: rename jenkins-slave to jenkins-agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [11:51:56] 10SRE, 10serviceops: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10Clement_Goubert) [11:52:23] (03PS10) 10Hashar: contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) [11:52:58] 10SRE, 10serviceops: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10Clement_Goubert) p:05Triage→03Medium [11:52:59] (PuppetFailure) firing: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:57:27] (03CR) 10Jelto: [V: 03+2 C: 03+2] wmf-debci: fix templating in Dockerfile RUN command [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982041 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto) [11:57:29] (03CR) 10Ilias Sarantopoulos: [C: 03+1] python-webapp: update mesh and base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey) [12:00:10] (03PS1) 10Clément Goubert: kubernetes10[59-62]: add to eqiad.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) [12:02:33] (03CR) 10Urbanecm: [C: 03+2] Revert "Growth: Enable Welcome survey user research for ar/en/es" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981734 (https://phabricator.wikimedia.org/T351266) (owner: 10Urbanecm) [12:02:59] (PuppetFailure) resolved: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:03:17] (03Merged) 10jenkins-bot: Revert "Growth: Enable Welcome survey user research for ar/en/es" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981734 (https://phabricator.wikimedia.org/T351266) (owner: 10Urbanecm) [12:03:57] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:981734|Revert "Growth: Enable Welcome survey user research for ar/en/es" (T351266)]] [12:04:02] T351266: enable the T342353 checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T351266 [12:05:24] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:981734|Revert "Growth: Enable Welcome survey user research for ar/en/es" (T351266)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:05:28] !log urbanecm@deploy2002 urbanecm: Continuing with sync [12:08:37] (03CR) 10Brouberol: [C: 03+2] Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [12:08:44] (03PS7) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) [12:11:25] !log Adding spark-history(-test).svc.eqiad.wmnet CNAMEs pointing to k8s-ingress-dse.svc.eqiad.wmnet. - T352639 [12:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:40] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [12:12:17] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:981734|Revert "Growth: Enable Welcome survey user research for ar/en/es" (T351266)]] (duration: 08m 20s) [12:12:21] T351266: enable the T342353 checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T351266 [12:15:20] (03CR) 10Ayounsi: [C: 03+1] "In prevision of deploying I1c0bfa369b886c648bf1f27afd6ee581daed0625" [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert) [12:24:34] (03CR) 10Clément Goubert: kubernetes10[59-62]: add to eqiad.k8s (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert) [12:25:34] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/981439 [12:26:10] (03PS1) 10Vgutierrez: hiera: Disable rp_filter on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982063 (https://phabricator.wikimedia.org/T351069) [12:27:58] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/870/con" [puppet] - 10https://gerrit.wikimedia.org/r/982063 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [12:28:54] (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) [12:30:11] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/871/con" [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [12:32:12] (03PS1) 10Clément Goubert: wikikube: put kubernetes10[59-62] in production [puppet] - 10https://gerrit.wikimedia.org/r/982071 (https://phabricator.wikimedia.org/T353135) [12:32:14] (03PS1) 10Clément Goubert: wikikube: add kubernetes10[59-62] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/982072 (https://phabricator.wikimedia.org/T353135) [12:37:05] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:39] (03PS2) 10Clément Goubert: kubernetes10[59-62]: add to devices.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) [12:47:19] (03CR) 10Hnowlan: [C: 03+1] restbase: add missing keys & certs, remove obsolete [labs/private] - 10https://gerrit.wikimedia.org/r/981601 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [12:48:07] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [12:51:41] (03CR) 10Cathal Mooney: [C: 03+1] "Nice! Removing those huge dicts is very pleasing to the eye :)" [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [12:54:05] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1005 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [12:54:17] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:31] (03CR) 10Clément Goubert: [C: 03+2] Provide python3-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [12:55:24] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Provide python3-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [12:56:29] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:21] !log Rebuilding production-images for python3-build-bookworm - T352733 [12:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:25] T352733: Provide python3-build-bookworm docker image - https://phabricator.wikimedia.org/T352733 [13:01:43] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:27] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:10] (03CR) 10Filippo Giunchedi: [C: 04-1] grafana: add dashboard datasource usage (graphite) exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [13:04:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [13:05:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [13:06:58] (03Abandoned) 10Filippo Giunchedi: Override Cumin batch sleep+size from command line [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi) [13:11:21] (03CR) 10Volans: [C: 04-1] "I did a first pass. There are still a lot of references to the server part and is missing a bunch of refactoring needed for the split. See" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [13:11:56] 10SRE-swift-storage, 10Observability-Metrics, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10fgiunchedi) >>! In T353091#9395917, @MatthewVernon wrote: > @fgiunchedi are you in a position to reduce some thanos disk usage/retention? Most swift drives are 93/4% ful... [13:12:48] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [13:13:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [13:14:09] (03PS1) 10Arnaudb: mariadb: decommission db1138 [puppet] - 10https://gerrit.wikimedia.org/r/981440 (https://phabricator.wikimedia.org/T350458) [13:14:58] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM - One nit in-line but all code looks good to me :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [13:17:37] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: decomission pre downtime [13:17:54] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: decomission pre downtime [13:18:04] (03CR) 10Jelto: [C: 03+1] "lgtm now, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/979912 (owner: 10EoghanGaffney) [13:18:33] (03CR) 10Klausman: [C: 03+1] python-webapp: update mesh and base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey) [13:20:09] (03CR) 10Marostegui: [C: 03+1] mariadb: decommission db1138 [puppet] - 10https://gerrit.wikimedia.org/r/981440 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [13:20:28] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1138.eqiad.wmnet [13:20:37] (03CR) 10Arnaudb: [C: 03+2] mariadb: decommission db1138 [puppet] - 10https://gerrit.wikimedia.org/r/981440 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [13:22:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'decommission db1138', diff saved to https://phabricator.wikimedia.org/P54328 and previous config saved to /var/cache/conftool/dbconfig/20231211-132250-arnaudb.json [13:25:32] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [13:25:38] 10ops-eqiad, 10decommission-hardware: decommission db1138.eqiad.wmnet - https://phabricator.wikimedia.org/T353148 (10ABran-WMF) 05In progress→03Open [13:26:57] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:26:58] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db1138.eqiad.wmnet [13:27:04] 10ops-eqiad, 10decommission-hardware: decommission db1138.eqiad.wmnet - https://phabricator.wikimedia.org/T353148 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by arnaudb@cumin1001 for hosts: `db1138.eqiad.wmnet` - db1138.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - F... [13:28:42] (03PS1) 10Marostegui: report_users: Remove 10.64.48.43 [software] - 10https://gerrit.wikimedia.org/r/982084 [13:29:22] (03CR) 10JMeybohm: [C: 03+1] wikikube: put kubernetes10[59-62] in production [puppet] - 10https://gerrit.wikimedia.org/r/982071 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert) [13:29:27] (03CR) 10JMeybohm: [C: 03+1] wikikube: add kubernetes10[59-62] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/982072 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert) [13:29:44] (03CR) 10Marostegui: [C: 03+2] report_users: Remove 10.64.48.43 [software] - 10https://gerrit.wikimedia.org/r/982084 (owner: 10Marostegui) [13:30:20] (03Merged) 10jenkins-bot: report_users: Remove 10.64.48.43 [software] - 10https://gerrit.wikimedia.org/r/982084 (owner: 10Marostegui) [13:39:28] (03PS1) 10Filippo Giunchedi: Move to standard rsyslog-rotate shared script [puppet] - 10https://gerrit.wikimedia.org/r/982085 (https://phabricator.wikimedia.org/T351710) [13:41:12] (03CR) 10Filippo Giunchedi: "The kafkatee logrotate is causing errors on centrallog where we upgraded rsyslog:" [puppet] - 10https://gerrit.wikimedia.org/r/982085 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [13:41:21] (03PS1) 10LSobanski: Switch alerts deployment source to GitLab [puppet] - 10https://gerrit.wikimedia.org/r/982086 (https://phabricator.wikimedia.org/T349626) [13:42:40] (03CR) 10LSobanski: [C: 04-2] "Prerequisites are not met yet so blocking for now." [puppet] - 10https://gerrit.wikimedia.org/r/982086 (https://phabricator.wikimedia.org/T349626) (owner: 10LSobanski) [13:45:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:45:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:48:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:49:10] (03PS7) 10Ayounsi: Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) [13:49:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:50:58] (03CR) 10Ayounsi: Expose Netbox's BGP servers to Homer (035 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [13:53:22] (03CR) 10MVernon: [C: 03+1] "Thanks, I think I follow how this all works!" [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi) [13:54:06] (03PS9) 10Ladsgroup: Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) [13:56:53] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [13:58:43] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:58:50] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:15] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [14:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1400) [14:00:06] Dreamy_Jazz and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] (03PS4) 10Anzx: hewikivoyage: update vector 2022 wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981726 (https://phabricator.wikimedia.org/T351981) [14:00:20] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:28] I can deploy [14:00:33] \o [14:00:35] o/ [14:01:01] wait one [14:02:07] Dreamy_Jazz: starting with yours [14:02:18] I will be unable to test my patch as I don't have CU rights on the wikis having the change enabled (testwiki already has the change enabled). I have informed checkusers about the change on checkuser-l. [14:02:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979986 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:02:20] (03PS1) 10Clément Goubert: prometheus-php-fpm-exporter: fix build script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089 [14:02:22] (03PS1) 10Clément Goubert: Fix some Build-Depends [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982090 [14:03:31] (03Merged) 10jenkins-bot: Enable read new on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979986 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:03:48] !log samtar@deploy2002 Started scap: Backport for [[gerrit:979986|Enable read new on group0 wikis (T341829)]] [14:04:00] T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829 [14:04:41] Dreamy_Jazz: did the testwiki deploy go okay? [14:04:44] Yes [14:05:02] !log samtar@deploy2002 samtar and dreamyjazz: Backport for [[gerrit:979986|Enable read new on group0 wikis (T341829)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:05:05] !log samtar@deploy2002 samtar and dreamyjazz: Continuing with sync [14:05:36] Then we'll continue and watch for issues :-) [14:05:59] Thanks. I intend to monitor logstash. [14:06:22] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:06:23] (03PS5) 10Samtar: hewikivoyage: update vector 2022 wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981726 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx) [14:10:29] Hi, I forgot again that backport window is now. Do we have some time to give https://phabricator.wikimedia.org/T350431 another try? :) [14:11:35] Kizule: quite probably, just got anzx's patch to do next [14:11:46] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:979986|Enable read new on group0 wikis (T341829)]] (duration: 07m 57s) [14:11:51] At least on srwikisource and such smaller projects firstly. :) [14:11:51] Dreamy_Jazz: deployed [14:11:51] T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829 [14:11:56] Thanks! [14:12:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981726 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx) [14:12:29] 721-725 and then Serbian Wikipedia if everything works out fine. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/dblists/s3.dblist#721 [14:12:37] I'll add task in Deployments page. [14:13:32] Done [14:14:03] (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on text|secondary LVS in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982096 (https://phabricator.wikimedia.org/T351069) [14:14:15] (03Merged) 10jenkins-bot: hewikivoyage: update vector 2022 wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981726 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx) [14:14:28] !log samtar@deploy2002 Started scap: Backport for [[gerrit:981726|hewikivoyage: update vector 2022 wordmark and tagline (T351981)]] [14:14:32] T351981: Change Hebrew Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T351981 [14:15:44] !log samtar@deploy2002 samtar and anzx: Backport for [[gerrit:981726|hewikivoyage: update vector 2022 wordmark and tagline (T351981)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:15:46] TheresNoTime: checking [14:15:54] ack [14:16:50] TheresNoTime: looks good [14:16:59] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1138.eqiad.wmnet - arnaudb@cumin1001" [14:17:03] !log samtar@deploy2002 samtar and anzx: Continuing with sync [14:17:06] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/982096 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:18:19] 10SRE-swift-storage: Q2 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10MatthewVernon) [14:18:54] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1138.eqiad.wmnet - arnaudb@cumin1001" [14:18:54] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:19:35] 10SRE-swift-storage: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10MatthewVernon) [14:20:05] TheresNoTime: mwscript namespaceDupes.php srwikibooks and so on. Firstly without --fix, so I can check output. And after that, if everything looks good, with --fix. :) [14:20:20] Kizule: ack, will do :) [14:20:37] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/dblists/s3.dblist#721 to 725, and then Serbian Wikipedia (srwiki). [14:20:39] Thanks! [14:21:36] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10MatthewVernon) [14:23:20] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [14:23:42] (03Abandoned) 10D3r1ck01: ClusterConfig: Followup on I955168f072315e0064c69a66483e61dfc23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981954 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [14:25:03] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:981726|hewikivoyage: update vector 2022 wordmark and tagline (T351981)]] (duration: 10m 35s) [14:25:05] anzx: live :) [14:25:10] T351981: Change Hebrew Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T351981 [14:25:25] Kizule: starting with `mwscript namespaceDupes.php srwikibooks` [14:25:43] Okay :) [14:25:44] `Unsafe to run at this time. See: T350443` [14:25:45] T350443: namespaceDupes.php doesn't have limit on write queries - https://phabricator.wikimedia.org/T350443 [14:26:29] (investigating) [14:26:46] Duh, it's not supposed to be there anymore. [14:27:12] It's not in master anymore at least https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/maintenance/namespaceDupes.php [14:27:12] (03PS1) 10Arnaudb: mariadb: remove db1141 db1142 db1143 [puppet] - 10https://gerrit.wikimedia.org/r/981441 (https://phabricator.wikimedia.org/T350458) [14:27:39] it's not been backported, https://phabricator.wikimedia.org/T350443#9379293 [14:27:40] (03CR) 10Marostegui: [C: 03+1] mariadb: remove db1141 db1142 db1143 [puppet] - 10https://gerrit.wikimedia.org/r/981441 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [14:29:11] Kizule: easiest way of dealing with this would be to wait until later in the week [14:30:52] TheresNoTime: Ok, let's give it a try in UTC late backport window on Thursday, after MW train. [14:30:58] TheresNoTime: it still displays old logo maybe ```run echo 'https://en.wikipedia.org/static/images/mobile/copyright/wikivoyage-wordmark-he.svg' | mwscript purgeList.php``` [14:31:05] anzx: ack [14:31:45] anzx: done [14:32:38] TheresNoTime: I think it should be done for tagline also [14:33:08] Never mind now it appears correct, thanks [14:33:25] Alright, I did a reschedule. :) [14:33:32] Kizule: okay, sorry! :) [14:33:44] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:33:46] No problem :) [14:36:33] (03PS1) 10Milimetric: aqs: update mw history snapshot probably last time [puppet] - 10https://gerrit.wikimedia.org/r/982097 [14:36:50] (03PS1) 10Ottomata: changeprop - bump image version to discard canary events [deployment-charts] - 10https://gerrit.wikimedia.org/r/982098 (https://phabricator.wikimedia.org/T351247) [14:37:03] !log close UTC afternoon backport window [14:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:36] (03CR) 10Herron: [C: 03+1] Move to standard rsyslog-rotate shared script [puppet] - 10https://gerrit.wikimedia.org/r/982085 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [14:37:41] (03PS1) 10Milimetric: edit*-analytics: update mediawiki_history snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982099 [14:39:10] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:15] (03CR) 10Btullis: [C: 03+1] aqs: update mw history snapshot probably last time [puppet] - 10https://gerrit.wikimedia.org/r/982097 (owner: 10Milimetric) [14:39:18] (03CR) 10Btullis: [C: 03+2] aqs: update mw history snapshot probably last time [puppet] - 10https://gerrit.wikimedia.org/r/982097 (owner: 10Milimetric) [14:40:22] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/982099 (owner: 10Milimetric) [14:42:11] (03CR) 10Ottomata: [C: 03+2] changeprop - bump image version to discard canary events [deployment-charts] - 10https://gerrit.wikimedia.org/r/982098 (https://phabricator.wikimedia.org/T351247) (owner: 10Ottomata) [14:43:07] (03Merged) 10jenkins-bot: changeprop - bump image version to discard canary events [deployment-charts] - 10https://gerrit.wikimedia.org/r/982098 (https://phabricator.wikimedia.org/T351247) (owner: 10Ottomata) [14:44:08] (03CR) 10Milimetric: [C: 03+2] edit*-analytics: update mediawiki_history snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982099 (owner: 10Milimetric) [14:45:07] (03Merged) 10jenkins-bot: edit*-analytics: update mediawiki_history snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982099 (owner: 10Milimetric) [14:45:11] !log deploying changeprop to pick up https://phabricator.wikimedia.org/T351247 [14:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:57] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [14:46:39] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [14:47:38] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye [14:47:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host sessionstore2006.codfw.wmnet with OS bullseye [14:48:19] (03CR) 10Arnaudb: [C: 03+1] "will be handy!" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [14:48:41] (03CR) 10Arnaudb: [C: 03+2] mariadb: remove db1141 db1142 db1143 [puppet] - 10https://gerrit.wikimedia.org/r/981441 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb) [14:48:48] !log milimetric@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [14:48:55] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [14:49:06] !log milimetric@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [14:49:35] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [14:50:36] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [14:50:41] !log milimetric@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [14:51:06] (03PS1) 10JMeybohm: Bump cert-manager to 1.10.1-2 (bullseye) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982100 (https://phabricator.wikimedia.org/T351933) [14:51:10] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [14:51:17] !log milimetric@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [14:51:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1141.eqiad.wmnet [14:52:07] !log milimetric@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [14:52:09] (03PS1) 10JMeybohm: Revert "cert-manager: bump version in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 [14:52:21] !log milimetric@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [14:52:39] (03PS2) 10JMeybohm: Revert "cert-manager: bump version in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 (https://phabricator.wikimedia.org/T351933) [14:53:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'decommission db1141 42 and 43', diff saved to https://phabricator.wikimedia.org/P54330 and previous config saved to /var/cache/conftool/dbconfig/20231211-145300-arnaudb.json [14:53:04] !log milimetric@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [14:53:34] !log milimetric@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [14:54:10] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:20] (03CR) 10Ayounsi: "thx good catch on the proper servers locations." [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert) [14:56:41] !log milimetric@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [14:56:53] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [14:57:14] !log milimetric@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [14:57:23] !log milimetric@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [14:57:41] !log milimetric@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [15:01:16] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1141.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:03:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1141.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:03:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:03:26] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1141.eqiad.wmnet [15:04:32] (03CR) 10Cathal Mooney: "LGTM when CI is happy" [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (owner: 10Ayounsi) [15:04:52] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2006.codfw.wmnet with reason: host reimage [15:05:09] (03PS3) 10Ottomata: Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798) [15:05:30] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:06:19] (03PS1) 10Filippo Giunchedi: alertmanager: add sink notifications capability [puppet] - 10https://gerrit.wikimedia.org/r/982103 (https://phabricator.wikimedia.org/T353060) [15:06:37] (03PS4) 10Ottomata: Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798) [15:08:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2006.codfw.wmnet with reason: host reimage [15:08:33] (03PS3) 10Filippo Giunchedi: swift: write to local files and ban before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) [15:08:40] (03CR) 10Filippo Giunchedi: swift: write to local files and ban before centrallog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi) [15:08:50] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: write to local files and ban before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi) [15:09:44] (03CR) 10MVernon: [C: 04-1] "One thing that looks a bit strange to me here, but perhaps I misunderstand..." [labs/private] - 10https://gerrit.wikimedia.org/r/981601 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [15:09:57] (03CR) 10Andrew Bogott: [C: 03+1] P:mail: use wmcloud.org instead of wmflabs.org in envelopes [puppet] - 10https://gerrit.wikimedia.org/r/981635 (owner: 10Majavah) [15:10:08] (03PS1) 10Dreamy Jazz: CheckUser: Enable read new for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982105 (https://phabricator.wikimedia.org/T341829) [15:10:48] 10ops-eqiad, 10decommission-hardware: decommission db1141.eqiad.wmnet - https://phabricator.wikimedia.org/T353152 (10ABran-WMF) a:05ABran-WMF→03None [15:10:59] 10ops-eqiad, 10decommission-hardware: decommission db1138.eqiad.wmnet - https://phabricator.wikimedia.org/T353148 (10ABran-WMF) a:05ABran-WMF→03None [15:11:04] (03CR) 10Elukey: [C: 03+1] Revert "cert-manager: bump version in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm) [15:11:44] (03CR) 10Elukey: [C: 03+1] Add new istio module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981332 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:12:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1142.eqiad.wmnet [15:13:09] (03PS1) 10Brouberol: Fix: make sure to generate a TLS certificate for the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/982106 (https://phabricator.wikimedia.org/T352639) [15:13:49] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/982096 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:14:14] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:14:57] (03CR) 10Fabfur: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/982063 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:17:39] (03PS2) 10Brouberol: admin_ng: fix gateway TLS setting for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982106 (https://phabricator.wikimedia.org/T352639) [15:17:42] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:17:46] (03CR) 10Elukey: [C: 03+1] "The new annotation looks good, but not 100% clear why we need it, buuut it is informative so I trust your judgement!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982100 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm) [15:17:50] (03CR) 10Vgutierrez: [V: 03+1] hiera: Enable IPIP encapsulation on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:17:57] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Disable rp_filter on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982063 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:18:49] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [15:19:28] (03CR) 10Elukey: "Thanks! Do we need a version bump since those images were built? Or it is just to allow rebuilds? I don't have a strong opinion, just rais" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982090 (owner: 10Clément Goubert) [15:20:32] (03PS3) 10Brouberol: admin_ng: fix gateway TLS setting for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982106 (https://phabricator.wikimedia.org/T352639) [15:20:54] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1142.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:21:07] (03CR) 10Elukey: [C: 03+1] admin_ng: fix gateway TLS setting for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982106 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [15:21:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:21:48] (03CR) 10JMeybohm: Bump cert-manager to 1.10.1-2 (bullseye) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982100 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm) [15:21:59] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1142.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:21:59] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:21:59] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1142.eqiad.wmnet [15:22:58] 10ops-eqiad, 10decommission-hardware: decommission db1142.eqiad.wmnet - https://phabricator.wikimedia.org/T353154 (10ABran-WMF) [15:23:03] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:23:10] (03CR) 10Brouberol: [C: 03+2] admin_ng: fix gateway TLS setting for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982106 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [15:23:36] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982110 (https://phabricator.wikimedia.org/T128546) [15:23:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:24:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:24:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:24:42] (03CR) 10JMeybohm: [C: 03+2] Bump cert-manager to 1.10.1-2 (bullseye) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982100 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm) [15:25:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:25:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1143.eqiad.wmnet [15:25:55] !log provisioning TLS certificates for the spark-history and spark-history-test namespaces in dse-k8s-eqiad - T352639 [15:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:59] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [15:26:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cephosd2002.mgmt.codfw.wmnet with reboot policy FORCED [15:27:15] (03PS1) 10Andrew Bogott: Horizon: backport 598bfa3aabe9cf2c1d09f58d4a0745462e80b1bc to 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/982111 (https://phabricator.wikimedia.org/T326818) [15:27:23] (03PS1) 10Filippo Giunchedi: swift: fix double-logging of proxy-server access logs [puppet] - 10https://gerrit.wikimedia.org/r/982112 (https://phabricator.wikimedia.org/T352968) [15:27:55] (03Merged) 10jenkins-bot: Bump cert-manager to 1.10.1-2 (bullseye) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982100 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm) [15:28:04] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: backport 598bfa3aabe9cf2c1d09f58d4a0745462e80b1bc to 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/982111 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott) [15:28:18] (03CR) 10JMeybohm: [C: 03+1] "Thanks. I don't think we technically need a version bump" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982090 (owner: 10Clément Goubert) [15:30:31] 10SRE-tools, 10Infrastructure-Foundations: Diffscan: host off-infra - https://phabricator.wikimedia.org/T265595 (10joanna_borun) p:05Triage→03Low [15:30:31] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [15:30:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd2002.mgmt.codfw.wmnet with reboot policy FORCED [15:30:46] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: fix double-logging of proxy-server access logs [puppet] - 10https://gerrit.wikimedia.org/r/982112 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi) [15:31:14] (03CR) 10Elukey: "Looks good! I left a comment since I got lost in one bit of the change, be patient :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:32:09] jouncebot: nowandnext [15:32:09] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [15:32:09] In 0 hour(s) and 57 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1630) [15:32:32] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1143.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:32:53] (03CR) 10Clément Goubert: [C: 03+2] wikikube: put kubernetes10[59-62] in production [puppet] - 10https://gerrit.wikimedia.org/r/982071 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert) [15:33:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bullseye [15:33:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [15:33:33] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1143.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:33:33] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:33:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1143.eqiad.wmnet [15:34:31] (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:34:48] 10ops-eqiad, 10decommission-hardware: decommission db1143.eqiad.wmnet - https://phabricator.wikimedia.org/T353156 (10ABran-WMF) a:05ABran-WMF→03None [15:36:55] (03CR) 10JMeybohm: ingress.istio: Remove trust for every SAN but the default (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:38:25] (03CR) 10Elukey: [C: 03+1] ingress.istio: Remove trust for every SAN but the default (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:38:31] (03CR) 10Elukey: [C: 03+1] function-orchestrator: Update to ingress.istio:1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981336 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:39:48] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:39:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2006.codfw.wmnet with OS bullseye [15:39:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host sessionstore2006.codfw.wmnet with OS bullseye completed: - sessionstore... [15:40:19] (03CR) 10Ayounsi: [C: 03+1] kubernetes10[59-62]: add to devices.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert) [15:40:29] (03CR) 10JMeybohm: [C: 03+2] function-orchestrator: Update to ingress.istio:1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981336 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:40:33] (03CR) 10JMeybohm: [C: 03+2] ingress.istio: Remove trust for every SAN but the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:40:37] (03CR) 10JMeybohm: [C: 03+2] Add new istio module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981332 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:41:04] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm) [15:41:07] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [15:41:07] (03CR) 10Clément Goubert: [C: 03+2] kubernetes10[59-62]: add to devices.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert) [15:41:08] 10SRE-tools, 10Infrastructure-Foundations: Introduce Spicerack.kafka module, along with the method to transfer offset state between consumer groups and clusters - https://phabricator.wikimedia.org/T291681 (10joanna_borun) 05Open→03Resolved p:05Triage→03Medium [15:41:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host sessionstore2005.codfw.wmnet with OS bullseye [15:41:31] (03Merged) 10jenkins-bot: Add new istio module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981332 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:41:50] (03Merged) 10jenkins-bot: ingress.istio: Remove trust for every SAN but the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:42:04] (03PS1) 10Brouberol: Register ingress CNAME record for the echoserver-dse-k8s-eqiad service [dns] - 10https://gerrit.wikimedia.org/r/982116 [15:42:08] (03Merged) 10jenkins-bot: function-orchestrator: Update to ingress.istio:1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981336 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:42:19] PROBLEM - Check systemd state on kubernetes1026 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:47] (03Merged) 10jenkins-bot: kubernetes10[59-62]: add to devices.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert) [15:42:57] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:43:09] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: admin: Add validation checks for missing realname and email in data.yaml - https://phabricator.wikimedia.org/T320937 (10joanna_borun) a:03jhathaway [15:44:43] RECOVERY - Check systemd state on kubernetes1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:48] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:45:54] (03PS2) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) [15:47:45] (03PS15) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [15:48:43] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:49:13] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:50:08] (03CR) 10JMeybohm: [C: 03+2] Revert "cert-manager: bump version in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm) [15:51:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd2002.codfw.wmnet with reason: host reimage [15:52:00] (03CR) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [15:52:45] (03Merged) 10jenkins-bot: Revert "cert-manager: bump version in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm) [15:53:11] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:53:28] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:53:47] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:53:55] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) [15:54:17] !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:54:52] 10SRE-swift-storage, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2), 10User-fgiunchedi: Stop sending swift access logs to centrallog for non state-changing requests - https://phabricator.wikimedia.org/T352968 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done! W... [15:54:54] (03PS1) 10EoghanGaffney: [apt-staging] Deploy gitlab-package-puller script [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004) [15:55:01] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:55:09] !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:55:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd2002.codfw.wmnet with reason: host reimage [15:55:42] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:55:50] !log homer lsw1-*eqiad* commit "Put kubernetes10[59-62] in production - T353135" [15:55:51] !log jayme@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:54] T353135: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 [15:56:48] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2005.codfw.wmnet with reason: host reimage [15:57:07] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:57:45] (03CR) 10Filippo Giunchedi: [C: 03+2] Move to standard rsyslog-rotate shared script [puppet] - 10https://gerrit.wikimedia.org/r/982085 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [15:57:47] !log jayme@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:58:35] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:58:46] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:59:28] That's me ^ I think we need to fix the doc so we do the homer commit after the reimage [15:59:41] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:00:22] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:00:57] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:01:38] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:01:52] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:01:59] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1059.eqiad.wmnet with OS bullseye [16:02:09] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1059.eqiad.wmnet with OS bullseye [16:02:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2005.codfw.wmnet with reason: host reimage [16:02:37] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1060.eqiad.wmnet with OS bullseye [16:02:47] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1060.eqiad.wmnet with OS bullseye [16:03:10] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1061.eqiad.wmnet with OS bullseye [16:03:22] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1061.eqiad.wmnet with OS bullseye [16:03:42] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1062.eqiad.wmnet with OS bullseye [16:03:52] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1062.eqiad.wmnet with OS bullseye [16:04:30] (03CR) 10Ottomata: [C: 03+2] Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata) [16:05:03] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on text|secondary LVS in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982096 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:05:16] XioNoX: About the BGP alerts because I messed up reimage/homer order, should I rollback homer changes or are we ok with them alerting until the cookbook runs its course? [16:05:17] (03Merged) 10jenkins-bot: Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata) [16:05:35] claime: it's fine, no worries [16:05:44] Awesome. I'll change the doc, thanks [16:05:46] 10SRE, 10Infrastructure-Foundations, 10Mail: Postfix MTA Profile - https://phabricator.wikimedia.org/T325398 (10jhathaway) p:05Triage→03Low [16:05:57] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-inbound-infra - https://phabricator.wikimedia.org/T325401 (10jhathaway) p:05Triage→03Low [16:06:03] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-outbound-infra - https://phabricator.wikimedia.org/T325402 (10jhathaway) p:05Triage→03Low [16:06:14] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-outbound-wiki - https://phabricator.wikimedia.org/T325407 (10jhathaway) p:05Triage→03Low [16:06:20] (03CR) 10Eevans: restbase: add missing keys & certs, remove obsolete (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/981601 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [16:06:21] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-inbound-wiki - https://phabricator.wikimedia.org/T325406 (10jhathaway) p:05Triage→03Low [16:06:27] (03PS4) 10Eevans: restbase: add missing keys & certs, remove obsolete [labs/private] - 10https://gerrit.wikimedia.org/r/981601 (https://phabricator.wikimedia.org/T352468) [16:06:29] 10SRE, 10Infrastructure-Foundations, 10Mail: Replace Null client configs - https://phabricator.wikimedia.org/T325408 (10jhathaway) p:05Triage→03Low [16:06:39] 10SRE, 10Infrastructure-Foundations, 10Mail: Remove Exim based MTAs - https://phabricator.wikimedia.org/T325409 (10jhathaway) p:05Triage→03Low [16:06:52] (03PS1) 10Ottomata: Revert accidental portals submodule change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982122 (https://phabricator.wikimedia.org/T266798) [16:07:51] (03CR) 10Ottomata: [C: 03+2] Revert accidental portals submodule change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982122 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata) [16:09:13] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [dns] - 10https://gerrit.wikimedia.org/r/982116 (owner: 10Brouberol) [16:09:16] (03Merged) 10jenkins-bot: Revert accidental portals submodule change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982122 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata) [16:10:02] (03CR) 10Brouberol: [C: 03+2] Register ingress CNAME record for the echoserver-dse-k8s-eqiad service [dns] - 10https://gerrit.wikimedia.org/r/982116 (owner: 10Brouberol) [16:10:34] !log enabling canary events for all mediawiki state change event streams - T266798 [16:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:42] T266798: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 [16:13:10] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1026 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:13:21] !log rolling restart of pybal on lvs1020 and lvs1017 effectively enabling IPIP encapsulation on ncredir@eqiad - T351069 [16:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:30] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [16:15:05] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:16:02] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) 05Open→03Resolved [16:16:04] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Vgutierrez) [16:16:31] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing swift/puppet problems - example reimage - https://phabricator.wikimedia.org/T308644 (10Volans) @MatthewVernon is there still anything pending from I/F on this task or can be resolved in light of the follow up work done i... [16:16:54] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1059.eqiad.wmnet with reason: host reimage [16:17:26] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1060.eqiad.wmnet with reason: host reimage [16:18:05] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1061.eqiad.wmnet with reason: host reimage [16:18:51] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1062.eqiad.wmnet with reason: host reimage [16:18:54] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:19:06] !log otto@deploy2002 Synchronized wmf-config/ext-EventStreamConfig.php: Config: [[gerrit:968344|Enable canary events for all MediaWiki event streams (T266798)]] (duration: 08m 25s) [16:19:10] T266798: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 [16:19:11] (03PS1) 10Vgutierrez: hiera: Unify ncredir IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/982124 (https://phabricator.wikimedia.org/T351069) [16:19:53] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1059.eqiad.wmnet with reason: host reimage [16:20:06] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:20:39] 10SRE-tools, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10Volans) a:03cmooney Assigning to Cathal as per meeting discussion. [16:21:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:21:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:21:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2005.codfw.wmnet with OS bullseye [16:21:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host sessionstore2005.codfw.wmnet with OS bullseye completed: - sessionstore... [16:21:46] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/873/console" [puppet] - 10https://gerrit.wikimedia.org/r/982124 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:22:55] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [16:22:57] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1062.eqiad.wmnet with reason: host reimage [16:23:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host sessionstore2004.codfw.wmnet with OS bullseye [16:23:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:25:33] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1061.eqiad.wmnet with reason: host reimage [16:26:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:26:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd2002.codfw.wmnet with OS bullseye [16:26:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye completed: - cephosd2002 (... [16:27:26] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1060.eqiad.wmnet with reason: host reimage [16:28:55] (03CR) 10Fabfur: [C: 03+1] hiera: Unify ncredir IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/982124 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:29:14] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Unify ncredir IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/982124 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:29:34] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:30:05] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1630). [16:32:34] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:36] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:35:29] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982110 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:35:31] ACKNOWLEDGEMENT - MD RAID on kubernetes1060 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.131.26. Check system logs on 10.64.131.26 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T353165 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:35:36] (03CR) 10Majavah: [V: 03+1 C: 03+2] netops: prometheus::hosts: also probe ipv6 if available [puppet] - 10https://gerrit.wikimedia.org/r/981358 (https://phabricator.wikimedia.org/T163996) (owner: 10Majavah) [16:35:36] 10SRE, 10ops-eqiad: Degraded RAID on kubernetes1060 - https://phabricator.wikimedia.org/T353165 (10ops-monitoring-bot) [16:35:42] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:36:27] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982110 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:37:05] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:30] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:37:54] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:38:52] (03PS1) 10Arnaudb: mariadb: add db1247 to instances [puppet] - 10https://gerrit.wikimedia.org/r/981443 (https://phabricator.wikimedia.org/T344036) [16:39:14] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage [16:40:14] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:40:58] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:41:30] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:mail: use wmcloud.org instead of wmflabs.org in envelopes [puppet] - 10https://gerrit.wikimedia.org/r/981635 (owner: 10Majavah) [16:41:50] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:42:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage [16:43:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.05% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:43:41] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1059.eqiad.wmnet with OS bullseye [16:43:51] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1059.eqiad.wmnet with OS bullseye completed: - kubernetes1059 (**WARN**) - Down... [16:47:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:47:31] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1062.eqiad.wmnet with OS bullseye [16:47:41] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1062.eqiad.wmnet with OS bullseye completed: - kubernetes1062 (**WARN**) - Down... [16:49:34] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1061.eqiad.wmnet with OS bullseye [16:49:43] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1061.eqiad.wmnet with OS bullseye completed: - kubernetes1061 (**WARN**) - Down... [16:50:04] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1060.eqiad.wmnet with OS bullseye [16:50:14] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1060.eqiad.wmnet with OS bullseye completed: - kubernetes1060 (**WARN**) - Down... [16:52:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:52:20] (03PS1) 10Jhancock.wm: Add testhost2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/982132 (https://phabricator.wikimedia.org/T352703) [16:52:22] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/982132 (https://phabricator.wikimedia.org/T352703) (owner: 10Jhancock.wm) [16:53:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:55:35] (03CR) 10Jhancock.wm: [C: 03+2] Add testhost2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/982132 (https://phabricator.wikimedia.org/T352703) (owner: 10Jhancock.wm) [16:56:25] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:982110| Bumping portals to master (T128546)]] (duration: 10m 12s) [16:56:29] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:56:40] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:57:23] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.02% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:57:29] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:58:14] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Volans) Interesting... I guess we could try to do the same test with redfish API instead and see if that works all the time and consider convertin... [17:00:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [17:00:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2004.codfw.wmnet with OS bullseye [17:00:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host sessionstore2004.codfw.wmnet with OS bullseye completed: - sessionstore... [17:03:47] (03PS2) 10Ayounsi: Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) [17:04:40] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:982110| Bumping portals to master (T128546)]] (duration: 08m 15s) [17:04:45] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [17:05:29] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @Andrew Dell is requesting smartctl output showing what drives errors are coming from if you can se... [17:05:41] (03CR) 10CI reject: [V: 04-1] Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi) [17:13:10] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:16:34] (03PS1) 10Jhancock.wm: Add testhost2001 to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/982137 (https://phabricator.wikimedia.org/T352703) [17:18:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:19:28] (03CR) 10Jhancock.wm: [C: 03+2] Add testhost2001 to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/982137 (https://phabricator.wikimedia.org/T352703) (owner: 10Jhancock.wm) [17:20:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [17:21:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) 05Open→03Resolved @BTullis this is completed! [17:29:37] (03CR) 10Hnowlan: [C: 03+1] changeprop: refactor templating for Kafka producer/consumer settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [17:33:10] (03CR) 10Ottomata: "see recent ideas about kafka broker round robin DNS in https://phabricator.wikimedia.org/T213561#9391755" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [17:40:57] (03PS1) 10Bking: wdqs: Try icinga-based check instead of blackbox [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) [17:42:41] !log jayme@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [17:43:38] !log jayme@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [17:43:39] !log jayme@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [17:43:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:45:41] !log jayme@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [17:45:42] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:47:05] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:47:06] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [17:48:19] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:49:35] (03PS2) 10RLazarus: admin_ng: Add namespace and ClusterRole for Job sidecar controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) [17:49:37] (03PS2) 10RLazarus: admin_ng: Switch on enableJobSidecarController for mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/981704 (https://phabricator.wikimedia.org/T348284) [17:52:11] (03CR) 10RLazarus: admin_ng: Add namespace and ClusterRole for Job sidecar controller (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1800) [18:00:04] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1800). [18:15:39] (03CR) 10Jdlrobson: [C: 03+1] Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [18:29:37] (03PS13) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) [18:29:45] (03PS1) 10Clément Goubert: mw-web: raise replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/982145 [18:30:10] (03CR) 10Clément Goubert: [C: 03+2] mw-web: raise replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/982145 (owner: 10Clément Goubert) [18:31:14] (03Merged) 10jenkins-bot: mw-web: raise replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/982145 (owner: 10Clément Goubert) [18:31:58] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:32:02] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:32:10] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:32:22] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:32:31] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:32:41] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:34:57] !log Raised replicas for mw-web [18:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:05] (03CR) 10Volans: Add retry logic to Netbox API (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi) [18:42:51] (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: eqiad1: permit memcached access via cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/977600 (owner: 10Majavah) [18:51:55] (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2031 [puppet] - 10https://gerrit.wikimedia.org/r/981605 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [18:52:22] (03CR) 10Herron: [V: 03+1] grafana: add dashboard datasource usage (graphite) exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [18:57:38] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:59:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:05:31] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:18:54] (03CR) 10Marostegui: [C: 03+1] mariadb: add db1247 to instances [puppet] - 10https://gerrit.wikimedia.org/r/981443 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [19:20:07] (03PS1) 10Jforrester: api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981737 (https://phabricator.wikimedia.org/T351237) [19:24:13] (03PS2) 10Pols12: Make wiktionary and mw.org provide og:site_name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) [19:26:02] (03CR) 10Pols12: Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [19:29:56] 10SRE, 10SRE-Access-Requests: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10Sandeeps) [19:34:29] 10SRE, 10SRE-Access-Requests: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10thcipriani) Approved from me! [19:34:31] (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:36:52] PROBLEM - cassandra-b CQL 10.192.32.227:9042 on restbase2031 is CRITICAL: connect to address 10.192.32.227 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [19:39:20] PROBLEM - cassandra-b SSL 10.192.32.227:7000 on restbase2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:41:48] PROBLEM - cassandra-b service on restbase2031 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:42:06] (03CR) 10Dzahn: [C: 03+2] "per team chat today" [puppet] - 10https://gerrit.wikimedia.org/r/981591 (https://phabricator.wikimedia.org/T347355) (owner: 10Dzahn) [19:44:03] (03CR) 10Dzahn: [C: 03+1] "lgtm, just needs to be watched because experience is there are a lot of ways it can fail unexpectedly in the first attempt" [puppet] - 10https://gerrit.wikimedia.org/r/981951 (https://phabricator.wikimedia.org/T343517) (owner: 10Jelto) [19:44:14] PROBLEM - cassandra-c CQL 10.192.32.228:9042 on restbase2031 is CRITICAL: connect to address 10.192.32.228 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [19:46:07] (03CR) 10Dzahn: "re: puppet compiler, I think you'd have to run this on the wdqs backend rather than alert1001, because exported resources are used" [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:46:40] PROBLEM - cassandra-c SSL 10.192.32.228:7000 on restbase2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:47:34] (03PS2) 10Bking: wdqs: Try icinga-based check instead of blackbox [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) [19:49:10] PROBLEM - cassandra-c service on restbase2031 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:50:10] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:59:00] (03CR) 10Dzahn: [C: 03+1] "check_https_url_for_string!query.wikidata.org!/bigdata/ldf?subject=wd%3AQ42&predicate=wdt%3AP31&object=!wd:Q42 wdt:P31 wd:Q5 ."," [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [20:08:50] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group for ArthurTaylor - https://phabricator.wikimedia.org/T352653 (10KFrancis) Hello all, I am confirming the NDA is now complete. Please proceed with the access request. Thank you! [20:16:03] (03PS1) 10Dzahn: Switch planet to bookworm VM backends [dns] - 10https://gerrit.wikimedia.org/r/982156 (https://phabricator.wikimedia.org/T348392) [20:17:08] (03PS1) 10Dzahn: site: remove buster VMs from planet regex [puppet] - 10https://gerrit.wikimedia.org/r/982157 (https://phabricator.wikimedia.org/T348392) [20:37:06] (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:39:46] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:41:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:51:41] (03PS1) 10Jdrewniak: [Zebra] Fix scrolling behavior in dropdowns [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) [20:52:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:54:04] (03PS1) 10Jdrewniak: [Vector] Deploy the Zebra CSS refactor under feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982162 (https://phabricator.wikimedia.org/T353008) [20:58:14] (03PS1) 10Ottomata: varnishkafka::instance - Add ensure param [puppet] - 10https://gerrit.wikimedia.org/r/982163 (https://phabricator.wikimedia.org/T238230) [21:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T2100). Please do the needful. [21:00:07] jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:40] * jan_drewniak Looks like I'm the only one with a backport today, so I can do my own deploy. [21:02:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) (owner: 10Jdrewniak) [21:07:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:21:11] (03CR) 10CI reject: [V: 04-1] [Zebra] Fix scrolling behavior in dropdowns [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) (owner: 10Jdrewniak) [21:26:21] (03CR) 10Jdrewniak: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979986 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [21:27:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) (owner: 10Jdrewniak) [21:28:46] (03CR) 10Jdrewniak: "recheck" [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) (owner: 10Jdrewniak) [21:35:34] (03CR) 10Bking: [C: 03+2] wdqs: Try icinga-based check instead of blackbox [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [21:44:57] (03Merged) 10jenkins-bot: [Zebra] Fix scrolling behavior in dropdowns [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) (owner: 10Jdrewniak) [21:53:58] jouncebot: nowandnext [21:53:59] For the next 0 hour(s) and 6 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T2100) [21:53:59] In 0 hour(s) and 6 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T2200) [21:54:18] (03CR) 10Ladsgroup: [C: 03+2] api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981737 (https://phabricator.wikimedia.org/T351237) (owner: 10Jforrester) [21:56:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981737 (https://phabricator.wikimedia.org/T351237) (owner: 10Jforrester) [21:59:22] Amir1: hey, the backport was delayed due to a failed test, I rechecked and the test passed (like 45min after the original +2), so I still have this to sync this after you're done https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/981740 [21:59:38] oh sure thing [21:59:43] I thought it was over [21:59:44] my bad [22:00:02] np [22:00:05] Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T2200). nyaa~ [22:01:56] PROBLEM - WDQS Linked Data Fragments Endpoint on wdqs1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string wd:Q42 wdt:P31 wd:Q5 . not found on https://query.wikidata.org:443/bigdata/ldf?subject=wd%3AQ42&predicate=wdt%3AP31&object= - 8890 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:03:22] (03CR) 10Ottomata: "No op https://puppet-compiler.wmflabs.org/output/982163/874/cp1102.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/982163 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [22:09:40] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 18:00:00 on wdqs1015.eqiad.wmnet with reason: T347355 [22:09:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on wdqs1015.eqiad.wmnet with reason: T347355 [22:09:59] T347355: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 [22:11:30] (03Merged) 10jenkins-bot: api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981737 (https://phabricator.wikimedia.org/T351237) (owner: 10Jforrester) [22:12:09] jan_drewniak: I think it'll deploy both at the same time [22:12:21] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:981737|api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery (T351237)]] [22:12:35] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [22:13:59] !log ladsgroup@deploy2002 jforrester and ladsgroup: Backport for [[gerrit:981737|api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:14:13] jan_drewniak: it's live in mwdebug [22:15:40] !log ladsgroup@deploy2002 jforrester and ladsgroup: Continuing with sync [22:17:03] (03PS1) 10Bking: wdqs: Change LDF monitoring URI [puppet] - 10https://gerrit.wikimedia.org/r/982172 (https://phabricator.wikimedia.org/T347355) [22:19:36] Amir1: thanks, I'll check it now [22:20:44] Amir1: my patch looks good to deploy, are you going to do the sync? [22:20:53] yup [22:21:12] ok thanks, I have one more config patch to deploy after this if that's ok [22:21:38] This one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/982162/ [22:23:03] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:981737|api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery (T351237)]] (duration: 10m 42s) [22:23:08] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [22:23:14] I'm done [22:25:02] Amir1: thanks! [22:25:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982162 (https://phabricator.wikimedia.org/T353008) (owner: 10Jdrewniak) [22:26:41] (03Merged) 10jenkins-bot: [Vector] Deploy the Zebra CSS refactor under feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982162 (https://phabricator.wikimedia.org/T353008) (owner: 10Jdrewniak) [22:26:55] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:982162|[Vector] Deploy the Zebra CSS refactor under feature flag (T353008)]] [22:27:00] T353008: Deploy Zebra everywhere - https://phabricator.wikimedia.org/T353008 [22:28:26] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:982162|[Vector] Deploy the Zebra CSS refactor under feature flag (T353008)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:32:00] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [22:39:10] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:982162|[Vector] Deploy the Zebra CSS refactor under feature flag (T353008)]] (duration: 12m 14s) [22:39:14] T353008: Deploy Zebra everywhere - https://phabricator.wikimedia.org/T353008 [22:42:34] (KubernetesCalicoDown) firing: (4) kubernetes2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:43:45] (03PS2) 10Hashar: Add a banner for the 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) [22:44:42] PROBLEM - SSH on kubemaster2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:46:34] PROBLEM - SSH on kubemaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:47:32] (03CR) 10Hashar: "PS2 adds a `[DISMISS]` button next to the link. On click that:" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) (owner: 10Hashar) [22:47:34] (KubernetesCalicoDown) firing: (60) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:50:16] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2031:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2031 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:50:40] (KubernetesAPINotScrapable) firing: k8s@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:52:34] (KubernetesCalicoDown) firing: (67) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:53:22] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:53:40] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:54:18] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:54:22] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:55:16] (KubernetesRsyslogDown) firing: (17) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:55:31] (KubernetesRsyslogDown) firing: (17) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:58:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:58:50] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:58:54] RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:59:24] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:00:16] (KubernetesRsyslogDown) firing: (24) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:01:27] (03CR) 10Dwisehaupt: "I just realized I draft commented this on Thursday but never sent it. I also learned about the puppet request window so I'm happy to add i" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:04:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:05:18] (KubernetesRsyslogDown) firing: (34) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:05:24] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:05:31] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:05:40] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:05:42] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:06:16] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:06:34] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:06:54] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:07:02] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:10:16] (KubernetesRsyslogDown) firing: (40) rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:14:33] (03CR) 10Dzahn: "ARG3 is a string to be found in the content." [puppet] - 10https://gerrit.wikimedia.org/r/982172 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [23:15:16] (KubernetesRsyslogDown) firing: (40) rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:15:42] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: OpenSent - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:16:03] (03CR) 10Dzahn: [C: 04-1] "changing ARG3 won't change that it's a 500. You can skip it entirely and still:" [puppet] - 10https://gerrit.wikimedia.org/r/982172 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [23:19:18] (03CR) 10Dzahn: [C: 04-1] "the $ARG2 (-u) is what makes it turn from 200 into 500:" [puppet] - 10https://gerrit.wikimedia.org/r/982172 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [23:19:36] RECOVERY - SSH on kubemaster2001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:20:12] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:20:16] (KubernetesRsyslogDown) firing: (42) rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:20:20] PROBLEM - Check systemd state on kubemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:22] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:20:42] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:24:12] PROBLEM - SSH on kubemaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:25:16] (KubernetesRsyslogDown) firing: (42) rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:26:08] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:26:26] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:26:44] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:27:40] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:30:16] (KubernetesRsyslogDown) firing: (37) rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:31:00] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:31:18] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:32:48] RECOVERY - SSH on kubemaster2002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:33:34] PROBLEM - Check systemd state on kubemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:32] (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:35:16] (KubernetesRsyslogDown) firing: (35) rsyslog on kubernetes2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:37:26] PROBLEM - SSH on kubemaster2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:40:16] (KubernetesRsyslogDown) firing: (38) rsyslog on kubernetes2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:42:08] RECOVERY - SSH on kubemaster2001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:42:58] PROBLEM - Check systemd state on kubemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:16] (KubernetesRsyslogDown) firing: (43) rsyslog on kubernetes2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:45:54] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv [23:45:54] e - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IP [23:45:54] ve - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:47:34] (KubernetesCalicoDown) firing: (67) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:47:49] (KubernetesCalicoDown) firing: (67) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:48:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:48:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:50:16] (KubernetesRsyslogDown) firing: (43) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:50:31] (KubernetesRsyslogDown) firing: (43) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:50:50] RECOVERY - SSH on kubemaster2002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:51:32] PROBLEM - Check systemd state on kubemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:52:34] (KubernetesCalicoDown) firing: (43) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:52:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51008 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:53:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:53:28] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv [23:53:28] e - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IP [23:53:28] ve - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitor [23:53:28] P_status [23:55:18] (KubernetesRsyslogDown) resolved: (39) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:55:40] (KubernetesAPINotScrapable) resolved: k8s@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:55:42] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv [23:55:42] e - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IP [23:55:42] ve - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/I [23:55:42] ive - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/ [23:55:42] tive - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602 [23:55:43] ctive - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS6460 [23:55:43] Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS646 [23:55:44] Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64 [23:55:44] : Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS6 [23:55:45] 6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:56:54] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:57:34] (KubernetesCalicoDown) firing: (53) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:58:26] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:58:44] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2056.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2054.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2052.codfw.wmnet, kubernetes2048.codfw.wmnet, kubernetes2059.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2047.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2029.codfw.wmnet, kubernetes2033.codfw. [23:58:44] ubernetes2008.codfw.wmnet, kubernetes2055.codfw.wmnet, kubernetes2044.codfw.wmnet are marked down but pooled: linkrecommendation-external_4006: Servers kubernetes2046.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2058.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2054.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes [23:58:44] fw.wmnet, kubernetes2049.codfw.wmnet, kubernetes2043.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2055.codfw.wmnet, kubernetes2027.codfw.wmnet are marked down but pooled: push-not https://wikitech.wikimedia.org/wiki/PyBal [23:59:44] (ProbeDown) firing: Service miscweb2003:30443 has failed probes (http_transparency_archive_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:59:51] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate